Benchmarking the Rust extension: 8× faster than pure NumPy
When I rewrote the simulation core in Rust for the 0.3.0 release I had a vague expectation it would be "fast". What I didn't have was a systematic benchmark suite. This post fixes that — here are the numbers across four assembly sizes, comparing the original NumPy baseline, the Rust CPU extension, and the new GPU path from 0.4.0.
Test setup
Machine: Apple M3 Max, 64 GB RAM, macOS 15.2. Python 3.12.3, NumPy 2.0.1, PyTorch 2.3.0. Each measurement is the median of 20 runs with 1-second warmup. Assembly sizes: 4, 12, 24, and 48 bodies. Each run simulates 10 seconds at 1 kHz (dt=1e-3), RK4 integrator.
Results
| Assembly size | NumPy (ms) | Rust CPU (ms) | Speedup | Rust+GPU (ms) | vs NumPy |
|---|---|---|---|---|---|
| 4 bodies | 41 | 8 | 5.1× | 12 | 3.4× |
| 12 bodies | 148 | 19 | 7.8× | 11 | 13.5× |
| 24 bodies | 491 | 63 | 7.8× | 18 | 27.3× |
| 48 bodies | 1840 | 228 | 8.1× | 83 | 22.2× |
A few things stand out. The GPU path is actually slower for 4-body assemblies — the dispatch overhead dominates at that scale. Break-even is around 10 bodies. At 48 bodies the GPU is a 2.7× win over Rust CPU. The Rust speedup versus NumPy is remarkably consistent at around 8× regardless of assembly size, which suggests the bottleneck is arithmetic throughput rather than memory latency or Python overhead.
Where does the time go?
I profiled the Rust extension with cargo flamegraph on the 24-body case. The breakdown:
- 58% — RK4 stage evaluation (4 force evaluations per step × 10,000 steps)
- 22% — constraint residual + Jacobian assembly
- 11% — Newton solver (inner iteration to enforce constraints at each RK4 stage)
- 9% — history buffer writes
The force evaluation is embarrassingly parallel — each body's derivative depends only on its neighbors in the constraint graph. I'm currently computing it serially. Adding Rayon-based parallelism to the RK4 stage evaluation is on the roadmap for 0.5; preliminary numbers suggest a 2–3× improvement on 8-core machines for assemblies of 20+ bodies.
NumPy isn't slow here for the reason you might think
The NumPy baseline isn't slow because of Python dispatch overhead per timestep — it's slow because the assembly-wide state vector is sparse and NumPy operates on dense arrays. At each RK4 stage, the force evaluation function touches maybe 6–8 non-zero entries in a 96-element Jacobian (for 48 bodies × 2 DOF). NumPy computes all 96 elements anyway. The Rust extension uses a sparse representation internally and skips the zeros. That's almost the entire 8× gap.
If you're hitting performance limits and can't use the GPU path, try reducing dt with integrator="verlet" for steady-state simulations — Verlet is cheaper per step and often allows 2× larger timesteps for energy-conserving systems.
Running the benchmarks yourself
pip install "molenspin[dev]"
python -m molenspin.bench --sizes 4,12,24,48 --duration 10 --dt 1e-3
Results are written to bench_results.json. PRs with benchmark results from other hardware are very welcome.