April 3, 2026 · deep-dive · 7 min read

Benchmarking the Rust extension: 8× faster than pure NumPy

When I rewrote the simulation core in Rust for the 0.3.0 release I had a vague expectation it would be "fast". What I didn't have was a systematic benchmark suite. This post fixes that — here are the numbers across four assembly sizes, comparing the original NumPy baseline, the Rust CPU extension, and the new GPU path from 0.4.0.

Test setup

Machine: Apple M3 Max, 64 GB RAM, macOS 15.2. Python 3.12.3, NumPy 2.0.1, PyTorch 2.3.0. Each measurement is the median of 20 runs with 1-second warmup. Assembly sizes: 4, 12, 24, and 48 bodies. Each run simulates 10 seconds at 1 kHz (dt=1e-3), RK4 integrator.

Results

Assembly size	NumPy (ms)	Rust CPU (ms)	Speedup	Rust+GPU (ms)	vs NumPy
4 bodies	41	8	5.1×	12	3.4×
12 bodies	148	19	7.8×	11	13.5×
24 bodies	491	63	7.8×	18	27.3×
48 bodies	1840	228	8.1×	83	22.2×

A few things stand out. The GPU path is actually slower for 4-body assemblies — the dispatch overhead dominates at that scale. Break-even is around 10 bodies. At 48 bodies the GPU is a 2.7× win over Rust CPU. The Rust speedup versus NumPy is remarkably consistent at around 8× regardless of assembly size, which suggests the bottleneck is arithmetic throughput rather than memory latency or Python overhead.

Where does the time go?

I profiled the Rust extension with cargo flamegraph on the 24-body case. The breakdown:

58% — RK4 stage evaluation (4 force evaluations per step × 10,000 steps)
22% — constraint residual + Jacobian assembly
11% — Newton solver (inner iteration to enforce constraints at each RK4 stage)
9% — history buffer writes

The force evaluation is embarrassingly parallel — each body's derivative depends only on its neighbors in the constraint graph. I'm currently computing it serially. Adding Rayon-based parallelism to the RK4 stage evaluation is on the roadmap for 0.5; preliminary numbers suggest a 2–3× improvement on 8-core machines for assemblies of 20+ bodies.

NumPy isn't slow here for the reason you might think

The NumPy baseline isn't slow because of Python dispatch overhead per timestep — it's slow because the assembly-wide state vector is sparse and NumPy operates on dense arrays. At each RK4 stage, the force evaluation function touches maybe 6–8 non-zero entries in a 96-element Jacobian (for 48 bodies × 2 DOF). NumPy computes all 96 elements anyway. The Rust extension uses a sparse representation internally and skips the zeros. That's almost the entire 8× gap.

💡

If you're hitting performance limits and can't use the GPU path, try reducing dt with integrator="verlet" for steady-state simulations — Verlet is cheaper per step and often allows 2× larger timesteps for energy-conserving systems.

Running the benchmarks yourself

pip install "molenspin[dev]"
python -m molenspin.bench --sizes 4,12,24,48 --duration 10 --dt 1e-3

Results are written to bench_results.json. PRs with benchmark results from other hardware are very welcome.