Top strategies for optimising Python notebooks

Introduction

Python notebooks have gained immense popularity among data scientists, researchers, and programmers for their interactive, user-friendly, and collaborative nature. However, they can sometimes become sluggish and slow due to the nature of the tasks being performed. In this blog post, we'll explore some of the most effective strategies to speed up your Python notebooks and enhance your overall experience.

Embrace Vectorized Operations

Firstly, you should always prefer vectorized operations over loops when working with large datasets. A vectorized operation is a technique that applies a single operation simultaneously to multiple elements in an array or a collection of data, rather than iterating through each element individually. This approach enables faster computations by taking advantage of low-level optimizations, hardware capabilities, and parallelism.

Libraries like NumPy and Pandas provide vectorized operations that reduce the number of iterations and enable faster computations by exploiting low-level optimizations. However, it's essential to understand the trade-offs when using vectorization, such as increased memory usage, which can be an issue for memory-constrained systems.

1import numpy as np
2
3# Instead of using loops
4result = [x*2 for x in range(1000000)]
5
6# Use vectorized operations
7result = np.arange(1000000) * 2

Choose Efficient Data Structures

Choosing the right data structures can have a significant impact on your notebook's performance. For instance, using Python's built-in set and dict types can improve the speed of membership tests and lookups, compared to using lists. However, keep in mind that these data structures might have slightly higher memory overheads.

1# Using a list (slower)
2my_list = ['apple', 'banana', 'orange']
3if 'banana' in my_list:
4	print("Found!")
5
6# Using a set (faster)
7my_set = {'apple', 'banana', 'orange'}
8
9if 'banana' in my_set:
10	print("Found!")

Understanding and knowing how different data structures work under the hood is crucial to writing optimised code.

Optimize Memory Usage

Working with large datasets can consume a lot of memory, slowing down your notebook. To counter this, use memory-efficient data structures and libraries like Dask, which allows you to work with larger-than-memory datasets by breaking them into smaller, manageable chunks. Keep in mind that Dask's performance is heavily dependent on available system resources, so ensure your hardware meets its requirements.

1import dask.dataframe as dd
2
3# Load a large CSV file using Dask (memory efficient)
4
5df = dd.read_csv('large_file.csv')

Another strategy that works very well is setting up data types when reading DataFrames. This is an important step to optimize memory usage and improve the performance of your data processing tasks. By explicitly defining the data types of columns, you can reduce memory consumption and ensure that your DataFrame operations are executed efficiently.

In Pandas, you can set up data types when reading a DataFrame from a file (e.g., CSV) by using the dtype parameter in the read_csv function. The dtype parameter accepts a dictionary that maps column names to their respective data types.

Here's an example of how to set up data types when reading a CSV file using Pandas:

1import pandas as pd
2
3# Define a dictionary with the desired data types for each column
4column_types = {
5  'column1': 'int32',
6  'column2': 'float32',
7  'column3': 'category'
8}
9
10# Read the CSV file, specifying the data types for each column
11df = pd.read_csv('data.csv', dtype=column_types)

In this example, we define a dictionary column_types that maps the column names to their respective data types. We then pass this dictionary to the dtype parameter of the read_csv function, which reads the data and assigns the specified data types to each column in the resulting DataFrame.

By setting up data types when reading DataFrames, you can optimize memory usage and improve the performance of your data processing tasks in Python notebooks.

Leverage Parallelization

Take advantage of multiple cores on your machine by parallelizing your code using libraries like concurrent.futures or multiprocessing. This can significantly improve the performance of your notebook for CPU-bound tasks. Keep in mind that parallelization introduces complexity and might not be suitable for all problems.

1from concurrent.futures import ThreadPoolExecutor
2
3def square(x):
4	return x * x
5  
6with ThreadPoolExecutor() as executor:
7	results = list(executor.map(square, range(100000)))

Consider Polars as an alternative to Pandas

Polars is a relatively new DataFrame library that can offer better performance than Pandas in certain scenarios. It is designed to be faster and more memory-efficient, making it an attractive alternative for large-scale data processing tasks. However, as a newer library, Polars may not have the same level of community support and extensive documentation that Pandas offers, so be prepared to invest some time in learning and adapting to this library.

1import polars as pl
2
3# Load a CSV file using Polars (potentially faster and more memory-efficient)
4df = pl.read_csv("large_file.csv")
5
6# Perform operations similar to Pandas
7filtered_df = df[df["column_name"] > 100]
8
9result = filtered_df.groupby("group_column").agg(pl.col("value_column").mean())

By considering Polars as an alternative to Pandas, you can potentially speed up your Python notebooks

Profile Your Code

Identify performance bottlenecks using Python's built-in profiling tools such as %timeit and %prun IPython magic commands. This will help you focus your optimization efforts on the most time-consuming parts of your code.

‍%timeit: The %timeit magic command is used to measure the execution time of a single statement or expression. It runs the given code multiple times, calculates the average execution time, and provides the results in a human-readable format. %timeit is helpful for quickly assessing the performance of different solutions or algorithms, allowing you to compare their efficiency.
‍
%prun: The %prun magic command is used to profile your Python code by collecting detailed statistics about the execution time and the number of calls for each function. This information helps you identify performance bottlenecks, enabling you to focus your optimization efforts on the most time-consuming parts of your code.

‍%prun provides a comprehensive report that includes the total execution time, the time spent on each function call, and the number of times each function was called. This command is particularly useful for analyzing complex code with multiple function calls.

import numpy as np

def slow_function():
	arr = np.random.rand(10000)
  return np.sum(arr)
  
%timeit slow_function()
# 129 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%prun slow_function()
#         384707 function calls (378009 primitive calls) in 83.116 CPU seconds
#
#   Ordered by: internal time
#
#   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
#    37706   41.693    0.001   41.693    0.001 {max}
#    20039   36.000    0.002   36.000    0.002 {min}
#    18835    1.848    0.000    2.208    0.000 helper.py:119(fftfreq)

Pre-generate Docker Images to avoid repeated Package Installation

Installing Python packages can be time-consuming, particularly when working with large and complex libraries like Tensorflow, Torch, ... Pre-generating Docker images can help you circumvent this issue by bundling all the necessary packages and dependencies in a single, reusable container. By doing so, you can avoid installing packages each time you run your notebook, thus speeding up the execution process.

To create a custom Docker image, follow these steps:

1. Create a Dockerfile in your project directory.

1# Set the base image
2FROM python:3.9
3
4# Install necessary libraries and packages
5RUN pip install numpy pandas matplotlib jupyter
6
7# Set the working directory
8WORKDIR /app
9
10# Copy your notebook files
11COPY . /app
12
13# Expose the Jupyter Notebook port
14EXPOSE 8888
15
16# Start the Jupyter Notebook server
17CMD ["jupyter", "notebook", "--ip='*'", "--port=8888", "--no-browser", "--allow-root"]

2. Build the Docker image.

docker build -t my_notebook_image .

‍

1import numpy as np
2
3# Instead of using loops
4result = [x*2 for x in range(1000000)]
5
6# Use vectorized operations
7result = np.arange(1000000) * 2

3. Run your Python notebook using the custom Docker image.

1docker run -it -p 8888:8888 my_notebook_image

By using pre-generated Docker images, you can significantly reduce the time spent on package installation and create a consistent environment for your Python notebooks. This approach ensures that your notebooks run smoothly and quickly, allowing you to focus on your work without worrying about package management or dependency issues.

Conclusion

In conclusion, optimizing Python notebooks is crucial for enhancing productivity, reducing execution time, and improving the overall user experience.

Each of these strategies addresses different aspects of notebook performance, such as computational speed, memory efficiency, and hardware utilization. By applying these techniques, you can significantly improve your Python notebook's performance and tackle the challenges associated with large datasets and complex computations.

It's important to remember that no single strategy will be a silver bullet for all scenarios. Each situation requires a critical evaluation of the specific challenges and trade-offs, and a tailored approach to optimization. By understanding the underlying principles and techniques presented in this blog post, you'll be well-equipped to make informed decisions and optimize your Python notebooks for a more streamlined and efficient workflow.

Happy coding!