Optimization & Performance Guide

MPNeuralNetwork is designed to be performant, but achieving optimal speed requires understanding how to configure the framework correctly.

1. Data Precision (Float32)

Rule #1: Use Float32.

By default, Python and NumPy use float64 (double precision). Deep Learning rarely benefits from this extra precision, but it costs 2x memory and ~2x bandwidth.

The framework enforces float32 internally (DTYPE = np.float32), but you should ensure your input data is cast before feeding it to the model to avoid on-the-fly conversion overhead.

# Default float64
X_train = np.random.randn(1000, 784)

# Explicit float32
X_train = np.random.randn(1000, 784).astype(np.float32)

2. Batch Size Selection

Choosing the right batch size is a trade-off between convergence stability and hardware utilization.

Too Small (1-16): High overhead due to Python loops. The vectorization engine (BLAS/LAPACK) is starved of data.
Too Large (2048+): Can lead to generalization issues and out-of-memory (OOM) errors.
Optimal (32-512): Typically, powers of 2 like 32, 64, 128, or 256 provide the best balance.

3. Hardware Acceleration (GPU)

MPNN supports NVIDIA GPUs via CuPy. This provides massive speedups for large matrix multiplications (Dense layers) and Convolutions.

Prerequisites

NVIDIA GPU
CUDA Toolkit installed
cupy python package installed matching your CUDA version (e.g., pip install cupy-cuda12x)

Note: Check your CUDA version with nvcc --version and install the corresponding package listed in the CuPy Installation Guide.

Enabling GPU Mode

Set the environment variable MPNN_BACKEND before running your script.

# Run on GPU
export MPNN_BACKEND=cupy
python my_script.py

Or inside Python (before importing mpneuralnetwork):

import os
os.environ["MPNN_BACKEND"] = "cupy"
import mpneuralnetwork

4. Benchmarking

The project includes a robust benchmarking suite to measure performance improvements or regressions.

Running Benchmarks

Benchmarks are located in the benchmark/ directory. The runner script executes them and profiles both time and memory.

python benchmark/run_benchmarks.py

This will generate reports in output/benchmark_TIMESTAMP/:

*.prof: CPU profile data.
*.bin: Memory usage data (Memray).
flamegraph.html: Interactive memory usage visualization.

Analyzing Results

Use snakeviz to visualize CPU bottlenecks:

snakeviz output/benchmark_.../cpu_profile.prof

5. Common Bottlenecks

`im2col` Memory Usage

Convolutional layers use im2col to vectorize operations. This expands the input image into a large matrix.

Impact: Memory usage grows by factor of $K^2$ (Kernel Size squared).
Mitigation:
- Reduce the batch_size.
- Use smaller kernels (e.g., 3x3 instead of 5x5).
- Use Strides: Increasing stride (e.g., stride=2) drastically reduces the output spatial dimensions and the size of the intermediate im2col matrix, saving both memory and compute.

Data Copying

The framework tries to minimize copies, but some operations (like flatten or transpose on non-contiguous arrays) force a copy.

Tip: Ensure your data is C-contiguous if you are doing manual pre-processing: x = np.ascontiguousarray(x).