April 3, 2026 · deep-dive · 7 min read

Benchmarking the Rust extension: 8× faster than pure NumPy

When I rewrote the simulation core in Rust for the 0.3.0 release I had a vague expectation it would be "fast". What I didn't have was a systematic benchmark suite. This post fixes that — here are the numbers across four assembly sizes, comparing the original NumPy baseline, the Rust CPU extension, and the new GPU path from 0.4.0.

Test setup

Machine: Apple M3 Max, 64 GB RAM, macOS 15.2. Python 3.12.3, NumPy 2.0.1, PyTorch 2.3.0. Each measurement is the median of 20 runs with 1-second warmup. Assembly sizes: 4, 12, 24, and 48 bodies. Each run simulates 10 seconds at 1 kHz (dt=1e-3), RK4 integrator.

Results

Assembly sizeNumPy (ms)Rust CPU (ms)SpeedupRust+GPU (ms)vs NumPy
4 bodies4185.1×123.4×
12 bodies148197.8×1113.5×
24 bodies491637.8×1827.3×
48 bodies18402288.1×8322.2×

A few things stand out. The GPU path is actually slower for 4-body assemblies — the dispatch overhead dominates at that scale. Break-even is around 10 bodies. At 48 bodies the GPU is a 2.7× win over Rust CPU. The Rust speedup versus NumPy is remarkably consistent at around 8× regardless of assembly size, which suggests the bottleneck is arithmetic throughput rather than memory latency or Python overhead.

Where does the time go?

I profiled the Rust extension with cargo flamegraph on the 24-body case. The breakdown:

The force evaluation is embarrassingly parallel — each body's derivative depends only on its neighbors in the constraint graph. I'm currently computing it serially. Adding Rayon-based parallelism to the RK4 stage evaluation is on the roadmap for 0.5; preliminary numbers suggest a 2–3× improvement on 8-core machines for assemblies of 20+ bodies.

NumPy isn't slow here for the reason you might think

The NumPy baseline isn't slow because of Python dispatch overhead per timestep — it's slow because the assembly-wide state vector is sparse and NumPy operates on dense arrays. At each RK4 stage, the force evaluation function touches maybe 6–8 non-zero entries in a 96-element Jacobian (for 48 bodies × 2 DOF). NumPy computes all 96 elements anyway. The Rust extension uses a sparse representation internally and skips the zeros. That's almost the entire 8× gap.

💡

If you're hitting performance limits and can't use the GPU path, try reducing dt with integrator="verlet" for steady-state simulations — Verlet is cheaper per step and often allows 2× larger timesteps for energy-conserving systems.

Running the benchmarks yourself

pip install "molenspin[dev]"
python -m molenspin.bench --sizes 4,12,24,48 --duration 10 --dt 1e-3

Results are written to bench_results.json. PRs with benchmark results from other hardware are very welcome.