
Python notebooks are popular for both data analysis and data science. Theyβre interactive, allowing you to write code, run it, and see the results in the same environment. But just like any code, your notebooks can become a mess if youβre not careful. In this blog post, weβll cover some effective practices for writing clean, readable, and maintainable code in Python notebooks.
Use Markdown Cells for Documentation
Python notebooks are not just about code; they also allow us to incorporate Markdown cells. These can be beneficial to structure your notebook and add explanations, making the notebook more understandable to others (or to yourself in the future).
Example: Poor Documentation with Comments
# Load the dataset and perform basic cleaning
# Remove rows with missing values and convert date column
import pandas as pd
data = pd.read_csv('sales_data.csv')
data = data.dropna() # Remove missing values
data['date'] = pd.to_datetime(data['date']) # Convert to datetime
print(f"Dataset shape: {data.shape}")
Example: Better Documentation with Markdown
Instead, use a Markdown cell above your code:
Markdown Cell:
## Data Loading and Cleaning
In this section, we load our sales dataset and perform initial cleaning:
- Remove rows with missing values to ensure data quality
- Convert the date column to datetime format for time series analysis
- Display basic information about the cleaned dataset
Code Cell:
import pandas as pd
# Load and clean the dataset
data = pd.read_csv('sales_data.csv')
data = data.dropna()
data['date'] = pd.to_datetime(data['date'])
print(f"Dataset shape: {data.shape}")
This approach provides clear separation between documentation and code, making your notebook more readable and professional.
Break Down Complex Code into Multiple Cells
Python notebooks allow you to execute chunks of code independently. This feature can be used to improve readability and debug code more effectively. Rather than writing all your code in a single cell, you can break it down into smaller pieces, each accomplishing a specific task.
Example: Complex Code in Single Cell (Poor Practice)
# BAD: Everything in one cell
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Load and preprocess data
data = pd.read_csv('customer_data.csv')
data = data.dropna()
data['age_group'] = pd.cut(data['age'], bins=[0, 25, 45, 65, 100], labels=['Young', 'Adult', 'Middle', 'Senior'])
data = pd.get_dummies(data, columns=['age_group', 'gender'])
X = data.drop('target', axis=1)
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model and evaluate
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2f}")
print(classification_report(y_test, predictions))
# Create visualization
feature_importance = pd.DataFrame({'feature': X.columns, 'importance': model.feature_importances_})
feature_importance = feature_importance.sort_values('importance', ascending=False)
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'][:10], feature_importance['importance'][:10])
plt.title('Top 10 Feature Importances')
plt.show()
Example: Breaking Down into Multiple Cells (Better Practice)
Cell 1: Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
Cell 2: Load and Clean Data
# Load the dataset
data = pd.read_csv('customer_data.csv')
print(f"Original dataset shape: {data.shape}")
# Remove missing values
data = data.dropna()
print(f"After removing NaN: {data.shape}")
Cell 3: Feature Engineering
# Create age groups
data['age_group'] = pd.cut(data['age'],
bins=[0, 25, 45, 65, 100],
labels=['Young', 'Adult', 'Middle', 'Senior'])
# Convert categorical variables to dummy variables
data = pd.get_dummies(data, columns=['age_group', 'gender'])
print(f"Features after encoding: {data.columns.tolist()}")
Cell 4: Prepare Training Data
# Separate features and target
X = data.drop('target', axis=1)
y = data['target']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
Cell 5: Train Model
# Initialize and train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
print("Model training completed!")
Cell 6: Evaluate Model
# Make predictions and evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, predictions))
Cell 7: Visualize Feature Importance
# Create feature importance visualization
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': model.feature_importances_
})
feature_importance = feature_importance.sort_values('importance', ascending=False)
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'][:10], feature_importance['importance'][:10])
plt.title('Top 10 Feature Importances')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()
Benefits of this approach:
- Each cell has a single, clear purpose
- Easy to debug individual steps
- Can re-run specific parts without rerunning everything
- Better readability and organization
Organize Your Notebook with Sections and Subsections
You can use Markdown cells to create sections and subsections, similar to how you would structure a standard report or document. This keeps your notebook organized and makes it easier for others to understand the flow of your analysis.
Example: Well-Organized Notebook Structure
Hereβs how you can structure a data analysis notebook using clear sections:
Markdown Cell - Main Title:
# Customer Churn Analysis Project
**Objective:** Analyze customer data to identify patterns and predict churn likelihood
**Dataset:** Customer transaction and demographic data (Jan 2023 - Dec 2023)
**Author:** Data Science Team
**Date:** 2024-01-15
Markdown Cell - Table of Contents:
## Table of Contents
1. [Data Import and Setup](#data-import)
2. [Exploratory Data Analysis](#eda)
- 2.1. Data Overview
- 2.2. Missing Values Analysis
- 2.3. Feature Distributions
3. [Data Preprocessing](#preprocessing)
- 3.1. Feature Engineering
- 3.2. Data Cleaning
4. [Model Development](#modeling)
- 4.1. Baseline Model
- 4.2. Feature Selection
- 4.3. Model Tuning
5. [Results and Conclusions](#results)
Markdown Cell - Section Header:
# 1. Data Import and Setup {#data-import}
In this section, we import necessary libraries and load our dataset.
Markdown Cell - Subsection:
## 2.1. Data Overview {#data-overview}
Let's examine the basic structure and characteristics of our dataset:
- Dataset dimensions
- Column data types
- Basic statistics
Markdown Cell - Analysis Results:
### Key Findings from EDA
From our exploratory analysis, we discovered:
1. **Missing Data**: 12% of records have missing income information
2. **Class Imbalance**: Only 23% of customers churned
3. **Key Patterns**:
- Higher churn rate among customers with >3 support tickets
- Customers with month-to-month contracts show 3x higher churn
- Premium customers have significantly lower churn rates
**Next Steps**: Based on these findings, we'll focus on feature engineering around customer support interactions and contract types.
This organizational approach makes your notebook:
Avoid Hard-Coding Values
Hard-coding values in your code can lead to mistakes and make it harder to maintain. Instead, assign important values to variables at the beginning of your notebook.
Example: Hard-coded Values (Poor Practice)
# BAD: Hard-coded values scattered throughout the notebook
# Cell 1
data = pd.read_csv('customers_2023.csv')
# Cell 5
filtered_data = data[data['age'] >= 25]
# Cell 12
plt.figure(figsize=(12, 8))
plt.title('Customer Analysis 2023')
# Cell 18
train_test_split(X, y, test_size=0.2, random_state=42)
# Cell 25
model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
# Cell 30
high_value_customers = data[data['total_spent'] >= 1000]
Example: Using Configuration Variables (Better Practice)
# GOOD: Configuration section at the beginning
# =============================================================================
# CONFIGURATION PARAMETERS
# =============================================================================
# File paths
DATA_FILE = 'customers_2023.csv'
OUTPUT_DIR = 'results/'
MODEL_SAVE_PATH = 'models/customer_model.pkl'
# Analysis parameters
MIN_AGE = 25
MIN_SPENDING_THRESHOLD = 1000
ANALYSIS_YEAR = 2023
# Model parameters
TEST_SIZE = 0.2
RANDOM_STATE = 42
N_ESTIMATORS = 100
MAX_DEPTH = 10
# Visualization parameters
FIGURE_SIZE = (12, 8)
DPI = 300
COLOR_PALETTE = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']
# =============================================================================
# ANALYSIS CODE
# =============================================================================
# Cell 1: Load data
data = pd.read_csv(DATA_FILE)
print(f"Loaded data from {DATA_FILE}")
# Cell 2: Filter by age
filtered_data = data[data['age'] >= MIN_AGE]
print(f"Filtered to customers aged {MIN_AGE}+: {len(filtered_data)} records")
# Cell 3: Create visualization
plt.figure(figsize=FIGURE_SIZE, dpi=DPI)
plt.title(f'Customer Analysis {ANALYSIS_YEAR}')
# Cell 4: Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE
)
# Cell 5: Train model
model = RandomForestClassifier(
n_estimators=N_ESTIMATORS,
max_depth=MAX_DEPTH,
random_state=RANDOM_STATE
)
# Cell 6: Identify high-value customers
high_value_customers = data[data['total_spent'] >= MIN_SPENDING_THRESHOLD]
print(f"High-value customers (${MIN_SPENDING_THRESHOLD}+): {len(high_value_customers)}")
Example: Using a Configuration Dictionary
# Alternative approach: Configuration dictionary
CONFIG = {
'data': {
'file_path': 'customers_2023.csv',
'min_age': 25,
'spending_threshold': 1000
},
'model': {
'test_size': 0.2,
'random_state': 42,
'n_estimators': 100,
'max_depth': 10
},
'visualization': {
'figure_size': (12, 8),
'color_palette': ['#1f77b4', '#ff7f0e', '#2ca02c']
}
}
# Usage throughout the notebook
data = pd.read_csv(CONFIG['data']['file_path'])
filtered_data = data[data['age'] >= CONFIG['data']['min_age']]
model = RandomForestClassifier(
n_estimators=CONFIG['model']['n_estimators'],
max_depth=CONFIG['model']['max_depth'],
random_state=CONFIG['model']['random_state']
)
Benefits of this approach:
- Easy maintenance: Change values in one place
- Better documentation: Clear parameter definitions
- Reproducibility: Consistent parameters across runs
- Flexibility: Easy to create different configurations for different scenarios
Include Error Handling
When your code encounters an error, itβs helpful to know why. You can include error handling in your code to catch exceptions and provide helpful error messages.
Example: Basic Error Handling
import pandas as pd
import os
# Basic file loading with error handling
def load_data_safely(file_path):
"""Load data with proper error handling"""
try:
if not os.path.exists(file_path):
raise FileNotFoundError(f"Data file not found: {file_path}")
data = pd.read_csv(file_path)
print(f"β
Successfully loaded {len(data)} records from {file_path}")
return data
except FileNotFoundError as e:
print(f"β File Error: {e}")
print("π‘ Please check the file path and ensure the file exists.")
return None
except pd.errors.EmptyDataError:
print(f"β Data Error: The file {file_path} is empty")
return None
except pd.errors.ParserError as e:
print(f"β Parse Error: Could not parse {file_path}")
print(f" Details: {e}")
return None
except Exception as e:
print(f"β Unexpected error loading {file_path}: {e}")
return None
# Usage
data = load_data_safely('customer_data.csv')
if data is not None:
print("Data loaded successfully, proceeding with analysis...")
else:
print("Cannot proceed without data. Please fix the data loading issue.")
Example: Robust Data Processing with Error Handling
def process_customer_data(data):
"""Process customer data with comprehensive error handling"""
if data is None or data.empty:
raise ValueError("Cannot process empty or None data")
try:
# Check required columns
required_columns = ['customer_id', 'age', 'total_spent', 'signup_date']
missing_columns = [col for col in required_columns if col not in data.columns]
if missing_columns:
raise KeyError(f"Missing required columns: {missing_columns}")
# Data type conversions with error handling
processed_data = data.copy()
# Convert age to numeric
try:
processed_data['age'] = pd.to_numeric(processed_data['age'], errors='coerce')
invalid_ages = processed_data['age'].isna().sum()
if invalid_ages > 0:
print(f"β οΈ Warning: {invalid_ages} invalid age values converted to NaN")
except Exception as e:
print(f"β Error converting age column: {e}")
# Convert date column
try:
processed_data['signup_date'] = pd.to_datetime(processed_data['signup_date'])
except Exception as e:
print(f"β Error converting signup_date: {e}")
print("π‘ Expected date format: YYYY-MM-DD")
# Validate data ranges
if (processed_data['age'] < 0).any():
print("β οΈ Warning: Found negative age values")
if (processed_data['total_spent'] < 0).any():
print("β οΈ Warning: Found negative spending values")
print("β
Data processing completed successfully")
return processed_data
except KeyError as e:
print(f"β Column Error: {e}")
print(f" Available columns: {list(data.columns)}")
return None
except Exception as e:
print(f"β Processing Error: {e}")
return None
# Usage with error handling
try:
processed_data = process_customer_data(data)
if processed_data is not None:
print(f"Processing complete. Final dataset shape: {processed_data.shape}")
else:
print("Data processing failed. Check the errors above.")
except Exception as e:
print(f"β Critical error in data processing: {e}")
Example: API Calls with Retry Logic
import requests
import time
from typing import Optional, Dict, Any
def fetch_data_with_retry(url: str, max_retries: int = 3, delay: float = 1.0) -> Optional[Dict[Any, Any]]:
"""
Fetch data from API with retry logic and proper error handling
"""
for attempt in range(max_retries):
try:
print(f"π Attempt {attempt + 1}/{max_retries}: Fetching data from {url}")
response = requests.get(url, timeout=30)
response.raise_for_status() # Raises an HTTPError for bad responses
data = response.json()
print(f"β
Successfully fetched data (attempt {attempt + 1})")
return data
except requests.exceptions.Timeout:
print(f"β° Timeout on attempt {attempt + 1}")
except requests.exceptions.ConnectionError:
print(f"π Connection error on attempt {attempt + 1}")
except requests.exceptions.HTTPError as e:
if response.status_code == 429: # Rate limited
print(f"π Rate limited on attempt {attempt + 1}, waiting longer...")
time.sleep(delay * 2) # Wait longer for rate limits
else:
print(f"π HTTP error {response.status_code} on attempt {attempt + 1}: {e}")
except requests.exceptions.RequestException as e:
print(f"π‘ Request error on attempt {attempt + 1}: {e}")
except ValueError as e: # JSON decode error
print(f"π JSON decode error on attempt {attempt + 1}: {e}")
except Exception as e:
print(f"β Unexpected error on attempt {attempt + 1}: {e}")
# Wait before retrying (except on last attempt)
if attempt < max_retries - 1:
print(f"β³ Waiting {delay} seconds before retry...")
time.sleep(delay)
delay *= 1.5 # Exponential backoff
print(f"β Failed to fetch data after {max_retries} attempts")
return None
# Usage
api_data = fetch_data_with_retry('https://api.example.com/customers')
if api_data:
print("API data loaded successfully")
else:
print("Using cached/backup data instead")
Benefits of proper error handling:
- Better user experience: Clear, helpful error messages
- Easier debugging: Specific information about what went wrong
- Graceful degradation: Notebook continues running even when some operations fail
- Production readiness: Robust code that handles edge cases
Regularly Restart the Kernel and Run All Cells
Over time, your notebookβs state may become messy due to running cells out of order or modifying variables. To ensure that your code is reproducible and doesnβt depend on any particular cell execution order, regularly restart the kernel and run all cells from the top. This action can help you catch hidden bugs that might go unnoticed otherwise.
Minimize the Output
While itβs helpful to use print statements for debugging and understanding how your code works, itβs a good practice to minimize the output when youβre done. This doesnβt mean you have to remove all print statements, but you should ensure that your code doesnβt produce an overwhelming amount of output, which can make your notebook hard to navigate.
For example, if youβre looping through a large list and printing the result for each iteration, consider removing the print statement or only printing for a subset of the iterations once youβre done.
Share Code Across Notebooks
Often, youβll find that you have a function or a class that could be useful in multiple notebooks. Instead of duplicating the code, consider putting it in a separate Python file and importing it. This makes your code more modular and easier to maintain.
Example: Creating Reusable Utility Functions
Step 1: Create a utility file (data_utils.py
)
# data_utils.py
"""
Data processing utilities for customer analysis notebooks
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Any, Optional, Tuple
def load_and_validate_data(file_path: str, required_columns: List[str]) -> Optional[pd.DataFrame]:
"""
Load CSV data and validate required columns exist
Args:
file_path: Path to the CSV file
required_columns: List of column names that must be present
Returns:
DataFrame if successful, None if validation fails
"""
try:
data = pd.read_csv(file_path)
missing_columns = [col for col in required_columns if col not in data.columns]
if missing_columns:
print(f"β Missing required columns: {missing_columns}")
return None
print(f"β
Successfully loaded {len(data)} records with all required columns")
return data
except Exception as e:
print(f"β Error loading data: {e}")
return None
def clean_customer_data(data: pd.DataFrame) -> pd.DataFrame:
"""
Standard cleaning pipeline for customer data
Args:
data: Raw customer DataFrame
Returns:
Cleaned DataFrame
"""
cleaned = data.copy()
# Remove duplicates
initial_rows = len(cleaned)
cleaned = cleaned.drop_duplicates()
duplicates_removed = initial_rows - len(cleaned)
if duplicates_removed > 0:
print(f"π§Ή Removed {duplicates_removed} duplicate records")
# Clean age column
cleaned['age'] = pd.to_numeric(cleaned['age'], errors='coerce')
cleaned = cleaned[cleaned['age'].between(0, 120)]
# Clean spending data
cleaned['total_spent'] = pd.to_numeric(cleaned['total_spent'], errors='coerce')
cleaned = cleaned[cleaned['total_spent'] >= 0]
# Convert dates
if 'signup_date' in cleaned.columns:
cleaned['signup_date'] = pd.to_datetime(cleaned['signup_date'], errors='coerce')
print(f"π§Ή Data cleaning complete. Final shape: {cleaned.shape}")
return cleaned
def create_age_groups(data: pd.DataFrame, age_col: str = 'age') -> pd.DataFrame:
"""Create standardized age groups"""
result = data.copy()
result['age_group'] = pd.cut(
result[age_col],
bins=[0, 25, 35, 50, 65, 100],
labels=['18-25', '26-35', '36-50', '51-65', '65+']
)
return result
def plot_distribution(data: pd.DataFrame, column: str, title: str = None) -> None:
"""Create a standardized distribution plot"""
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Histogram
data[column].hist(bins=30, ax=ax1, alpha=0.7)
ax1.set_title(f'Distribution of {column}' if title is None else title)
ax1.set_xlabel(column)
ax1.set_ylabel('Frequency')
# Box plot
data.boxplot(column=column, ax=ax2)
ax2.set_title(f'{column} Box Plot')
plt.tight_layout()
plt.show()
def calculate_customer_metrics(data: pd.DataFrame) -> Dict[str, Any]:
"""Calculate standard customer metrics"""
metrics = {
'total_customers': len(data),
'avg_age': data['age'].mean(),
'avg_spending': data['total_spent'].mean(),
'spending_std': data['total_spent'].std(),
'high_value_customers': len(data[data['total_spent'] >= 1000]),
'age_groups': data['age_group'].value_counts().to_dict() if 'age_group' in data.columns else {}
}
return metrics
class CustomerAnalyzer:
"""Reusable customer analysis class"""
def __init__(self, data: pd.DataFrame):
self.data = data
self.metrics = {}
def run_basic_analysis(self) -> Dict[str, Any]:
"""Run comprehensive basic analysis"""
self.metrics = calculate_customer_metrics(self.data)
print("π Customer Analysis Summary:")
print(f" Total Customers: {self.metrics['total_customers']:,}")
print(f" Average Age: {self.metrics['avg_age']:.1f}")
print(f" Average Spending: ${self.metrics['avg_spending']:,.2f}")
print(f" High-Value Customers: {self.metrics['high_value_customers']:,}")
return self.metrics
def plot_spending_by_age(self) -> None:
"""Create spending vs age visualization"""
plt.figure(figsize=(10, 6))
plt.scatter(self.data['age'], self.data['total_spent'], alpha=0.6)
plt.xlabel('Age')
plt.ylabel('Total Spent ($)')
plt.title('Customer Spending by Age')
plt.show()
Step 2: Use utilities in your notebooks
Notebook 1: Data Exploration
# Import our utilities
import sys
sys.path.append('.') # Add current directory to path
from data_utils import load_and_validate_data, clean_customer_data, create_age_groups, CustomerAnalyzer
# Configuration
REQUIRED_COLUMNS = ['customer_id', 'age', 'total_spent', 'signup_date']
DATA_FILE = 'customer_data.csv'
# Load and clean data using our utilities
data = load_and_validate_data(DATA_FILE, REQUIRED_COLUMNS)
if data is not None:
cleaned_data = clean_customer_data(data)
cleaned_data = create_age_groups(cleaned_data)
# Run analysis
analyzer = CustomerAnalyzer(cleaned_data)
metrics = analyzer.run_basic_analysis()
analyzer.plot_spending_by_age()
Notebook 2: Advanced Analysis
# Same utilities, different analysis focus
from data_utils import load_and_validate_data, clean_customer_data, plot_distribution
# Load data using the same reliable functions
data = load_and_validate_data('customer_data.csv', ['customer_id', 'age', 'total_spent'])
if data is not None:
cleaned_data = clean_customer_data(data)
# Focus on different visualizations
plot_distribution(cleaned_data, 'total_spent', 'Customer Spending Distribution')
plot_distribution(cleaned_data, 'age', 'Age Distribution')
Step 3: Create a package structure for larger projects
project/
βββ notebooks/
β βββ 01_data_exploration.ipynb
β βββ 02_customer_segmentation.ipynb
β βββ 03_predictive_modeling.ipynb
βββ src/
β βββ __init__.py
β βββ data_processing/
β β βββ __init__.py
β β βββ cleaning.py
β β βββ validation.py
β βββ analysis/
β β βββ __init__.py
β β βββ customer_metrics.py
β β βββ visualization.py
β βββ models/
β βββ __init__.py
β βββ customer_segmentation.py
βββ requirements.txt
Import in notebooks:
# Add src to path and import
import sys
sys.path.append('../src')
from data_processing.cleaning import clean_customer_data
from analysis.customer_metrics import CustomerAnalyzer
from analysis.visualization import plot_distribution
Benefits of this approach:
- Code reusability: Write once, use in multiple notebooks
- Easier maintenance: Fix bugs in one place
- Better testing: Can unit test utility functions separately
- Cleaner notebooks: Focus on analysis, not boilerplate code
- Team collaboration: Shared utilities across team members
- Version control: Track changes to utilities separately
These practices, along with the previous ones, will help you create Python notebooks that are clean, efficient, and understandable, both to others and your future self. Clean code is a continuous practice, but the effort pays off in better productivity, easier debugging, and more maintainable code.
Happy coding!