Optimization & Performance Guide
MPNeuralNetwork is designed to be performant, but achieving optimal speed requires understanding how to configure the framework correctly.
1. Data Precision (Float32)
Rule #1: Use Float32.
By default, Python and NumPy use float64 (double precision). Deep Learning rarely benefits from this extra precision, but it costs 2x memory and ~2x bandwidth.
The framework enforces float32 internally (DTYPE = np.float32), but you should ensure your input data is cast before feeding it to the model to avoid on-the-fly conversion overhead.
# Default float64
X_train = np.random.randn(1000, 784)
# Explicit float32
X_train = np.random.randn(1000, 784).astype(np.float32)
2. Batch Size Selection
Choosing the right batch size is a trade-off between convergence stability and hardware utilization.
- Too Small (1-16): High overhead due to Python loops. The vectorization engine (BLAS/LAPACK) is starved of data.
- Too Large (2048+): Can lead to generalization issues and out-of-memory (OOM) errors.
- Optimal (32-512): Typically, powers of 2 like 32, 64, 128, or 256 provide the best balance.
3. Hardware Acceleration (GPU)
MPNN supports NVIDIA GPUs via CuPy. This provides massive speedups for large matrix multiplications (Dense layers) and Convolutions.
Prerequisites
- NVIDIA GPU
- CUDA Toolkit installed
cupypython package installed matching your CUDA version (e.g.,pip install cupy-cuda12x)
Note: Check your CUDA version with
nvcc --versionand install the corresponding package listed in the CuPy Installation Guide.
Enabling GPU Mode
Set the environment variable MPNN_BACKEND before running your script.
# Run on GPU
export MPNN_BACKEND=cupy
python my_script.py
Or inside Python (before importing mpneuralnetwork):
import os
os.environ["MPNN_BACKEND"] = "cupy"
import mpneuralnetwork
4. Benchmarking
The project includes a robust benchmarking suite to measure performance improvements or regressions.
Running Benchmarks
Benchmarks are located in the benchmark/ directory. The runner script executes them and profiles both time and memory.
python benchmark/run_benchmarks.py
This will generate reports in output/benchmark_TIMESTAMP/:
*.prof: CPU profile data.*.bin: Memory usage data (Memray).flamegraph.html: Interactive memory usage visualization.
Analyzing Results
Use snakeviz to visualize CPU bottlenecks:
snakeviz output/benchmark_.../cpu_profile.prof
5. Common Bottlenecks
im2col Memory Usage
Convolutional layers use im2col to vectorize operations. This expands the input image into a large matrix.
- Impact: Memory usage grows by factor of $K^2$ (Kernel Size squared).
- Mitigation:
- Reduce the
batch_size. - Use smaller kernels (e.g., 3x3 instead of 5x5).
- Use Strides: Increasing
stride(e.g.,stride=2) drastically reduces the output spatial dimensions and the size of the intermediateim2colmatrix, saving both memory and compute.
- Reduce the
Data Copying
The framework tries to minimize copies, but some operations (like flatten or transpose on non-contiguous arrays) force a copy.
- Tip: Ensure your data is C-contiguous if you are doing manual pre-processing:
x = np.ascontiguousarray(x).