Mineo
Notebooks

How to Write Clean Code in Python Notebooks

Essential practices for writing clean and efficient code in Python notebooks.

Diego Garcia β€’ β€’ 5 min
How to Write Clean Code in Python Notebooks

Python notebooks are popular for both data analysis and data science. They’re interactive, allowing you to write code, run it, and see the results in the same environment. But just like any code, your notebooks can become a mess if you’re not careful. In this blog post, we’ll cover some effective practices for writing clean, readable, and maintainable code in Python notebooks.

Use Markdown Cells for Documentation

Python notebooks are not just about code; they also allow us to incorporate Markdown cells. These can be beneficial to structure your notebook and add explanations, making the notebook more understandable to others (or to yourself in the future).

Example: Poor Documentation with Comments

# Load the dataset and perform basic cleaning
# Remove rows with missing values and convert date column
import pandas as pd
data = pd.read_csv('sales_data.csv')
data = data.dropna()  # Remove missing values
data['date'] = pd.to_datetime(data['date'])  # Convert to datetime
print(f"Dataset shape: {data.shape}")

Example: Better Documentation with Markdown

Instead, use a Markdown cell above your code:

Markdown Cell:

## Data Loading and Cleaning

In this section, we load our sales dataset and perform initial cleaning:
- Remove rows with missing values to ensure data quality
- Convert the date column to datetime format for time series analysis
- Display basic information about the cleaned dataset

Code Cell:

import pandas as pd

# Load and clean the dataset
data = pd.read_csv('sales_data.csv')
data = data.dropna()
data['date'] = pd.to_datetime(data['date'])

print(f"Dataset shape: {data.shape}")

This approach provides clear separation between documentation and code, making your notebook more readable and professional.

Break Down Complex Code into Multiple Cells

Python notebooks allow you to execute chunks of code independently. This feature can be used to improve readability and debug code more effectively. Rather than writing all your code in a single cell, you can break it down into smaller pieces, each accomplishing a specific task.

Example: Complex Code in Single Cell (Poor Practice)

# BAD: Everything in one cell
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load and preprocess data
data = pd.read_csv('customer_data.csv')
data = data.dropna()
data['age_group'] = pd.cut(data['age'], bins=[0, 25, 45, 65, 100], labels=['Young', 'Adult', 'Middle', 'Senior'])
data = pd.get_dummies(data, columns=['age_group', 'gender'])
X = data.drop('target', axis=1)
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model and evaluate
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2f}")
print(classification_report(y_test, predictions))

# Create visualization
feature_importance = pd.DataFrame({'feature': X.columns, 'importance': model.feature_importances_})
feature_importance = feature_importance.sort_values('importance', ascending=False)
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'][:10], feature_importance['importance'][:10])
plt.title('Top 10 Feature Importances')
plt.show()

Example: Breaking Down into Multiple Cells (Better Practice)

Cell 1: Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

Cell 2: Load and Clean Data

# Load the dataset
data = pd.read_csv('customer_data.csv')
print(f"Original dataset shape: {data.shape}")

# Remove missing values
data = data.dropna()
print(f"After removing NaN: {data.shape}")

Cell 3: Feature Engineering

# Create age groups
data['age_group'] = pd.cut(data['age'], 
                          bins=[0, 25, 45, 65, 100], 
                          labels=['Young', 'Adult', 'Middle', 'Senior'])

# Convert categorical variables to dummy variables
data = pd.get_dummies(data, columns=['age_group', 'gender'])
print(f"Features after encoding: {data.columns.tolist()}")

Cell 4: Prepare Training Data

# Separate features and target
X = data.drop('target', axis=1)
y = data['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")

Cell 5: Train Model

# Initialize and train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
print("Model training completed!")

Cell 6: Evaluate Model

# Make predictions and evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, predictions))

Cell 7: Visualize Feature Importance

# Create feature importance visualization
feature_importance = pd.DataFrame({
    'feature': X.columns, 
    'importance': model.feature_importances_
})
feature_importance = feature_importance.sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'][:10], feature_importance['importance'][:10])
plt.title('Top 10 Feature Importances')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()

Benefits of this approach:

  • Each cell has a single, clear purpose
  • Easy to debug individual steps
  • Can re-run specific parts without rerunning everything
  • Better readability and organization

Organize Your Notebook with Sections and Subsections

You can use Markdown cells to create sections and subsections, similar to how you would structure a standard report or document. This keeps your notebook organized and makes it easier for others to understand the flow of your analysis.

Example: Well-Organized Notebook Structure

Here’s how you can structure a data analysis notebook using clear sections:

Markdown Cell - Main Title:

# Customer Churn Analysis Project

**Objective:** Analyze customer data to identify patterns and predict churn likelihood

**Dataset:** Customer transaction and demographic data (Jan 2023 - Dec 2023)

**Author:** Data Science Team  
**Date:** 2024-01-15

Markdown Cell - Table of Contents:

## Table of Contents

1. [Data Import and Setup](#data-import)
2. [Exploratory Data Analysis](#eda)
   - 2.1. Data Overview
   - 2.2. Missing Values Analysis
   - 2.3. Feature Distributions
3. [Data Preprocessing](#preprocessing)
   - 3.1. Feature Engineering
   - 3.2. Data Cleaning
4. [Model Development](#modeling)
   - 4.1. Baseline Model
   - 4.2. Feature Selection
   - 4.3. Model Tuning
5. [Results and Conclusions](#results)

Markdown Cell - Section Header:

# 1. Data Import and Setup {#data-import}

In this section, we import necessary libraries and load our dataset.

Markdown Cell - Subsection:

## 2.1. Data Overview {#data-overview}

Let's examine the basic structure and characteristics of our dataset:
- Dataset dimensions
- Column data types
- Basic statistics

Markdown Cell - Analysis Results:

### Key Findings from EDA

From our exploratory analysis, we discovered:

1. **Missing Data**: 12% of records have missing income information
2. **Class Imbalance**: Only 23% of customers churned
3. **Key Patterns**: 
   - Higher churn rate among customers with >3 support tickets
   - Customers with month-to-month contracts show 3x higher churn
   - Premium customers have significantly lower churn rates

**Next Steps**: Based on these findings, we'll focus on feature engineering around customer support interactions and contract types.

This organizational approach makes your notebook:

Avoid Hard-Coding Values

Hard-coding values in your code can lead to mistakes and make it harder to maintain. Instead, assign important values to variables at the beginning of your notebook.

Example: Hard-coded Values (Poor Practice)

# BAD: Hard-coded values scattered throughout the notebook

# Cell 1
data = pd.read_csv('customers_2023.csv')

# Cell 5
filtered_data = data[data['age'] >= 25]

# Cell 12
plt.figure(figsize=(12, 8))
plt.title('Customer Analysis 2023')

# Cell 18  
train_test_split(X, y, test_size=0.2, random_state=42)

# Cell 25
model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

# Cell 30
high_value_customers = data[data['total_spent'] >= 1000]

Example: Using Configuration Variables (Better Practice)

# GOOD: Configuration section at the beginning
# =============================================================================
# CONFIGURATION PARAMETERS
# =============================================================================

# File paths
DATA_FILE = 'customers_2023.csv'
OUTPUT_DIR = 'results/'
MODEL_SAVE_PATH = 'models/customer_model.pkl'

# Analysis parameters
MIN_AGE = 25
MIN_SPENDING_THRESHOLD = 1000
ANALYSIS_YEAR = 2023

# Model parameters
TEST_SIZE = 0.2
RANDOM_STATE = 42
N_ESTIMATORS = 100
MAX_DEPTH = 10

# Visualization parameters
FIGURE_SIZE = (12, 8)
DPI = 300
COLOR_PALETTE = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']

# =============================================================================
# ANALYSIS CODE
# =============================================================================

# Cell 1: Load data
data = pd.read_csv(DATA_FILE)
print(f"Loaded data from {DATA_FILE}")

# Cell 2: Filter by age
filtered_data = data[data['age'] >= MIN_AGE]
print(f"Filtered to customers aged {MIN_AGE}+: {len(filtered_data)} records")

# Cell 3: Create visualization
plt.figure(figsize=FIGURE_SIZE, dpi=DPI)
plt.title(f'Customer Analysis {ANALYSIS_YEAR}')

# Cell 4: Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE
)

# Cell 5: Train model
model = RandomForestClassifier(
    n_estimators=N_ESTIMATORS, 
    max_depth=MAX_DEPTH, 
    random_state=RANDOM_STATE
)

# Cell 6: Identify high-value customers
high_value_customers = data[data['total_spent'] >= MIN_SPENDING_THRESHOLD]
print(f"High-value customers (${MIN_SPENDING_THRESHOLD}+): {len(high_value_customers)}")

Example: Using a Configuration Dictionary

# Alternative approach: Configuration dictionary
CONFIG = {
    'data': {
        'file_path': 'customers_2023.csv',
        'min_age': 25,
        'spending_threshold': 1000
    },
    'model': {
        'test_size': 0.2,
        'random_state': 42,
        'n_estimators': 100,
        'max_depth': 10
    },
    'visualization': {
        'figure_size': (12, 8),
        'color_palette': ['#1f77b4', '#ff7f0e', '#2ca02c']
    }
}

# Usage throughout the notebook
data = pd.read_csv(CONFIG['data']['file_path'])
filtered_data = data[data['age'] >= CONFIG['data']['min_age']]
model = RandomForestClassifier(
    n_estimators=CONFIG['model']['n_estimators'],
    max_depth=CONFIG['model']['max_depth'],
    random_state=CONFIG['model']['random_state']
)

Benefits of this approach:

  • Easy maintenance: Change values in one place
  • Better documentation: Clear parameter definitions
  • Reproducibility: Consistent parameters across runs
  • Flexibility: Easy to create different configurations for different scenarios

Include Error Handling

When your code encounters an error, it’s helpful to know why. You can include error handling in your code to catch exceptions and provide helpful error messages.

Example: Basic Error Handling

import pandas as pd
import os

# Basic file loading with error handling
def load_data_safely(file_path):
    """Load data with proper error handling"""
    try:
        if not os.path.exists(file_path):
            raise FileNotFoundError(f"Data file not found: {file_path}")
        
        data = pd.read_csv(file_path)
        print(f"βœ… Successfully loaded {len(data)} records from {file_path}")
        return data
        
    except FileNotFoundError as e:
        print(f"❌ File Error: {e}")
        print("πŸ’‘ Please check the file path and ensure the file exists.")
        return None
        
    except pd.errors.EmptyDataError:
        print(f"❌ Data Error: The file {file_path} is empty")
        return None
        
    except pd.errors.ParserError as e:
        print(f"❌ Parse Error: Could not parse {file_path}")
        print(f"   Details: {e}")
        return None
        
    except Exception as e:
        print(f"❌ Unexpected error loading {file_path}: {e}")
        return None

# Usage
data = load_data_safely('customer_data.csv')
if data is not None:
    print("Data loaded successfully, proceeding with analysis...")
else:
    print("Cannot proceed without data. Please fix the data loading issue.")

Example: Robust Data Processing with Error Handling

def process_customer_data(data):
    """Process customer data with comprehensive error handling"""
    
    if data is None or data.empty:
        raise ValueError("Cannot process empty or None data")
    
    try:
        # Check required columns
        required_columns = ['customer_id', 'age', 'total_spent', 'signup_date']
        missing_columns = [col for col in required_columns if col not in data.columns]
        
        if missing_columns:
            raise KeyError(f"Missing required columns: {missing_columns}")
        
        # Data type conversions with error handling
        processed_data = data.copy()
        
        # Convert age to numeric
        try:
            processed_data['age'] = pd.to_numeric(processed_data['age'], errors='coerce')
            invalid_ages = processed_data['age'].isna().sum()
            if invalid_ages > 0:
                print(f"⚠️  Warning: {invalid_ages} invalid age values converted to NaN")
        except Exception as e:
            print(f"❌ Error converting age column: {e}")
            
        # Convert date column
        try:
            processed_data['signup_date'] = pd.to_datetime(processed_data['signup_date'])
        except Exception as e:
            print(f"❌ Error converting signup_date: {e}")
            print("πŸ’‘ Expected date format: YYYY-MM-DD")
            
        # Validate data ranges
        if (processed_data['age'] < 0).any():
            print("⚠️  Warning: Found negative age values")
            
        if (processed_data['total_spent'] < 0).any():
            print("⚠️  Warning: Found negative spending values")
            
        print("βœ… Data processing completed successfully")
        return processed_data
        
    except KeyError as e:
        print(f"❌ Column Error: {e}")
        print(f"   Available columns: {list(data.columns)}")
        return None
        
    except Exception as e:
        print(f"❌ Processing Error: {e}")
        return None

# Usage with error handling
try:
    processed_data = process_customer_data(data)
    if processed_data is not None:
        print(f"Processing complete. Final dataset shape: {processed_data.shape}")
    else:
        print("Data processing failed. Check the errors above.")
        
except Exception as e:
    print(f"❌ Critical error in data processing: {e}")

Example: API Calls with Retry Logic

import requests
import time
from typing import Optional, Dict, Any

def fetch_data_with_retry(url: str, max_retries: int = 3, delay: float = 1.0) -> Optional[Dict[Any, Any]]:
    """
    Fetch data from API with retry logic and proper error handling
    """
    
    for attempt in range(max_retries):
        try:
            print(f"πŸ”„ Attempt {attempt + 1}/{max_retries}: Fetching data from {url}")
            
            response = requests.get(url, timeout=30)
            response.raise_for_status()  # Raises an HTTPError for bad responses
            
            data = response.json()
            print(f"βœ… Successfully fetched data (attempt {attempt + 1})")
            return data
            
        except requests.exceptions.Timeout:
            print(f"⏰ Timeout on attempt {attempt + 1}")
            
        except requests.exceptions.ConnectionError:
            print(f"πŸ”Œ Connection error on attempt {attempt + 1}")
            
        except requests.exceptions.HTTPError as e:
            if response.status_code == 429:  # Rate limited
                print(f"πŸ“ˆ Rate limited on attempt {attempt + 1}, waiting longer...")
                time.sleep(delay * 2)  # Wait longer for rate limits
            else:
                print(f"🌐 HTTP error {response.status_code} on attempt {attempt + 1}: {e}")
                
        except requests.exceptions.RequestException as e:
            print(f"πŸ“‘ Request error on attempt {attempt + 1}: {e}")
            
        except ValueError as e:  # JSON decode error
            print(f"πŸ“„ JSON decode error on attempt {attempt + 1}: {e}")
            
        except Exception as e:
            print(f"❌ Unexpected error on attempt {attempt + 1}: {e}")
        
        # Wait before retrying (except on last attempt)
        if attempt < max_retries - 1:
            print(f"⏳ Waiting {delay} seconds before retry...")
            time.sleep(delay)
            delay *= 1.5  # Exponential backoff
    
    print(f"❌ Failed to fetch data after {max_retries} attempts")
    return None

# Usage
api_data = fetch_data_with_retry('https://api.example.com/customers')
if api_data:
    print("API data loaded successfully")
else:
    print("Using cached/backup data instead")

Benefits of proper error handling:

  • Better user experience: Clear, helpful error messages
  • Easier debugging: Specific information about what went wrong
  • Graceful degradation: Notebook continues running even when some operations fail
  • Production readiness: Robust code that handles edge cases

Regularly Restart the Kernel and Run All Cells

Over time, your notebook’s state may become messy due to running cells out of order or modifying variables. To ensure that your code is reproducible and doesn’t depend on any particular cell execution order, regularly restart the kernel and run all cells from the top. This action can help you catch hidden bugs that might go unnoticed otherwise.

Minimize the Output

While it’s helpful to use print statements for debugging and understanding how your code works, it’s a good practice to minimize the output when you’re done. This doesn’t mean you have to remove all print statements, but you should ensure that your code doesn’t produce an overwhelming amount of output, which can make your notebook hard to navigate.

For example, if you’re looping through a large list and printing the result for each iteration, consider removing the print statement or only printing for a subset of the iterations once you’re done.

Share Code Across Notebooks

Often, you’ll find that you have a function or a class that could be useful in multiple notebooks. Instead of duplicating the code, consider putting it in a separate Python file and importing it. This makes your code more modular and easier to maintain.

Example: Creating Reusable Utility Functions

Step 1: Create a utility file (data_utils.py)

# data_utils.py
"""
Data processing utilities for customer analysis notebooks
"""

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Any, Optional, Tuple

def load_and_validate_data(file_path: str, required_columns: List[str]) -> Optional[pd.DataFrame]:
    """
    Load CSV data and validate required columns exist
    
    Args:
        file_path: Path to the CSV file
        required_columns: List of column names that must be present
        
    Returns:
        DataFrame if successful, None if validation fails
    """
    try:
        data = pd.read_csv(file_path)
        
        missing_columns = [col for col in required_columns if col not in data.columns]
        if missing_columns:
            print(f"❌ Missing required columns: {missing_columns}")
            return None
            
        print(f"βœ… Successfully loaded {len(data)} records with all required columns")
        return data
        
    except Exception as e:
        print(f"❌ Error loading data: {e}")
        return None

def clean_customer_data(data: pd.DataFrame) -> pd.DataFrame:
    """
    Standard cleaning pipeline for customer data
    
    Args:
        data: Raw customer DataFrame
        
    Returns:
        Cleaned DataFrame
    """
    cleaned = data.copy()
    
    # Remove duplicates
    initial_rows = len(cleaned)
    cleaned = cleaned.drop_duplicates()
    duplicates_removed = initial_rows - len(cleaned)
    if duplicates_removed > 0:
        print(f"🧹 Removed {duplicates_removed} duplicate records")
    
    # Clean age column
    cleaned['age'] = pd.to_numeric(cleaned['age'], errors='coerce')
    cleaned = cleaned[cleaned['age'].between(0, 120)]
    
    # Clean spending data
    cleaned['total_spent'] = pd.to_numeric(cleaned['total_spent'], errors='coerce')
    cleaned = cleaned[cleaned['total_spent'] >= 0]
    
    # Convert dates
    if 'signup_date' in cleaned.columns:
        cleaned['signup_date'] = pd.to_datetime(cleaned['signup_date'], errors='coerce')
    
    print(f"🧹 Data cleaning complete. Final shape: {cleaned.shape}")
    return cleaned

def create_age_groups(data: pd.DataFrame, age_col: str = 'age') -> pd.DataFrame:
    """Create standardized age groups"""
    result = data.copy()
    result['age_group'] = pd.cut(
        result[age_col], 
        bins=[0, 25, 35, 50, 65, 100], 
        labels=['18-25', '26-35', '36-50', '51-65', '65+']
    )
    return result

def plot_distribution(data: pd.DataFrame, column: str, title: str = None) -> None:
    """Create a standardized distribution plot"""
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
    
    # Histogram
    data[column].hist(bins=30, ax=ax1, alpha=0.7)
    ax1.set_title(f'Distribution of {column}' if title is None else title)
    ax1.set_xlabel(column)
    ax1.set_ylabel('Frequency')
    
    # Box plot
    data.boxplot(column=column, ax=ax2)
    ax2.set_title(f'{column} Box Plot')
    
    plt.tight_layout()
    plt.show()

def calculate_customer_metrics(data: pd.DataFrame) -> Dict[str, Any]:
    """Calculate standard customer metrics"""
    metrics = {
        'total_customers': len(data),
        'avg_age': data['age'].mean(),
        'avg_spending': data['total_spent'].mean(),
        'spending_std': data['total_spent'].std(),
        'high_value_customers': len(data[data['total_spent'] >= 1000]),
        'age_groups': data['age_group'].value_counts().to_dict() if 'age_group' in data.columns else {}
    }
    return metrics

class CustomerAnalyzer:
    """Reusable customer analysis class"""
    
    def __init__(self, data: pd.DataFrame):
        self.data = data
        self.metrics = {}
        
    def run_basic_analysis(self) -> Dict[str, Any]:
        """Run comprehensive basic analysis"""
        self.metrics = calculate_customer_metrics(self.data)
        
        print("πŸ“Š Customer Analysis Summary:")
        print(f"   Total Customers: {self.metrics['total_customers']:,}")
        print(f"   Average Age: {self.metrics['avg_age']:.1f}")
        print(f"   Average Spending: ${self.metrics['avg_spending']:,.2f}")
        print(f"   High-Value Customers: {self.metrics['high_value_customers']:,}")
        
        return self.metrics
    
    def plot_spending_by_age(self) -> None:
        """Create spending vs age visualization"""
        plt.figure(figsize=(10, 6))
        plt.scatter(self.data['age'], self.data['total_spent'], alpha=0.6)
        plt.xlabel('Age')
        plt.ylabel('Total Spent ($)')
        plt.title('Customer Spending by Age')
        plt.show()

Step 2: Use utilities in your notebooks

Notebook 1: Data Exploration

# Import our utilities
import sys
sys.path.append('.')  # Add current directory to path
from data_utils import load_and_validate_data, clean_customer_data, create_age_groups, CustomerAnalyzer

# Configuration
REQUIRED_COLUMNS = ['customer_id', 'age', 'total_spent', 'signup_date']
DATA_FILE = 'customer_data.csv'

# Load and clean data using our utilities
data = load_and_validate_data(DATA_FILE, REQUIRED_COLUMNS)
if data is not None:
    cleaned_data = clean_customer_data(data)
    cleaned_data = create_age_groups(cleaned_data)
    
    # Run analysis
    analyzer = CustomerAnalyzer(cleaned_data)
    metrics = analyzer.run_basic_analysis()
    analyzer.plot_spending_by_age()

Notebook 2: Advanced Analysis

# Same utilities, different analysis focus
from data_utils import load_and_validate_data, clean_customer_data, plot_distribution

# Load data using the same reliable functions
data = load_and_validate_data('customer_data.csv', ['customer_id', 'age', 'total_spent'])
if data is not None:
    cleaned_data = clean_customer_data(data)
    
    # Focus on different visualizations
    plot_distribution(cleaned_data, 'total_spent', 'Customer Spending Distribution')
    plot_distribution(cleaned_data, 'age', 'Age Distribution')

Step 3: Create a package structure for larger projects

project/
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_data_exploration.ipynb
β”‚   β”œβ”€β”€ 02_customer_segmentation.ipynb
β”‚   └── 03_predictive_modeling.ipynb
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ data_processing/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ cleaning.py
β”‚   β”‚   └── validation.py
β”‚   β”œβ”€β”€ analysis/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ customer_metrics.py
β”‚   β”‚   └── visualization.py
β”‚   └── models/
β”‚       β”œβ”€β”€ __init__.py
β”‚       └── customer_segmentation.py
└── requirements.txt

Import in notebooks:

# Add src to path and import
import sys
sys.path.append('../src')

from data_processing.cleaning import clean_customer_data
from analysis.customer_metrics import CustomerAnalyzer
from analysis.visualization import plot_distribution

Benefits of this approach:

  • Code reusability: Write once, use in multiple notebooks
  • Easier maintenance: Fix bugs in one place
  • Better testing: Can unit test utility functions separately
  • Cleaner notebooks: Focus on analysis, not boilerplate code
  • Team collaboration: Shared utilities across team members
  • Version control: Track changes to utilities separately

These practices, along with the previous ones, will help you create Python notebooks that are clean, efficient, and understandable, both to others and your future self. Clean code is a continuous practice, but the effort pays off in better productivity, easier debugging, and more maintainable code.

Happy coding!