Newton VBD — launch block_dim sweep

Soft (cloth) + rigid (cable) paths · NVIDIA L40 (sm_89, 142 SMs) · Warp 1.13.0rc1 · CUDA-graph timed, densest/settled frame · 2026-07-02

TL;DR — recommended block_dim (vs Warp default 256; detection/tile default 16)

Method

Each kernel's real launch is recorded during one solver.step() (reversible wp.launch monkeypatch), then replayed N-in-a-CUDA-graph across block_dim ∈ {8…1024} (warmup → 100-pass graph → replay >1.5 s → ScopedTimer(sync), per-pass mean). Tile kernel swept via a tile_size factory (block_dim ≡ tile size). End-to-end times full solver.step(), default vs tuned. Soft = 4 cloth datasets at densest self-contact frame; rigid = 4 cable examples instantiated headless, settled 40 frames. High block_dim that won't compile/launch is caught & skipped.

Soft path (cloth) — per-kernel optima

Per-pass µs at the optimal block_dim; speedup vs current default. green ≥1.15× · yellow ≥1.05×.

kerneldefaultpoker_cardstwistfrankarollersoverall rec
forward_step25664
1.8µs · 1.05×
128
1.9µs · 1.04×
128
1.9µs · 1.01×
128
1.9µs · 1.01×
128
accumulate_self_contact_force_and_hessian256128
11.1µs · 1.01×
256
38.6µs · 1.00×
256
18.1µs · 1.00×
256
57.6µs · 1.00×
256
vertex_triangle_collision_detection_kernel168
63.3µs · 1.42×
8
194.5µs · 1.37×
8
108.2µs · 1.49×
8
180.3µs · 1.37×
8
edge_colliding_edges_detection_kernel168
166.0µs · 1.18×
8
503.8µs · 1.25×
8
618.2µs · 1.22×
8
756.1µs · 1.15×
8
update_velocity25664
1.5µs · 1.06×
128
1.5µs · 1.05×
128
1.6µs · 1.01×
128
1.6µs · 1.01×
128
apply_truncation_ts25664
1.5µs · 1.07×
128
1.5µs · 1.05×
128
1.6µs · 1.01×
128
1.6µs · 1.01×
128
apply_planar_truncation_parallel_by_collision256256
10.2µs · 1.00×
128
23.2µs · 1.02×
32
11.0µs · 1.05×
256
39.1µs · 1.00×
32
solve_elasticity_tile1632
6.8µs · 1.02×
32
7.5µs · 1.02×
32
7.7µs · 1.20×
32
12.6µs · 1.03×
32

Soft end-to-end (full step)

exampledefault µs/stepoverall-rec µs (×)per-example-opt (×)
franka3508.63162.8 (1.11×)3176.8 (1.10×)
poker_cards1671.81515.6 (1.10×)1520.6 (1.10×)
rollers7142.76732.8 (1.06×)6764.1 (1.06×)
twist4265.03813.3 (1.12×)3792.2 (1.12×)
2026-07-01-per-kernel.png 2026-07-01-endtoend.png

Rigid path (cables) — per-kernel optima

kerneldefaulttwistpilebundle_hysteresisy_junctionoverall rec
forward_step_rigid_bodies2568
2.1µs · 1.07×
64
2.2µs · 1.51×
8
2.1µs · 1.46×
32
2.3µs · 1.03×
64
accumulate_body_body_contacts_per_body2568
6.1µs · 1.83×
8
7.7µs · 1.34×
8
19.4µs · 1.60×
8
12.3µs · 1.05×
8
solve_rigid_body2568
10.6µs · 1.04×
16
12.2µs · 1.30×
8
9.6µs · 1.20×
8
15.0µs · 1.02×
8
update_duals_body_body_contacts2568
3.7µs · 1.07×
64
4.3µs · 1.16×
32
4.0µs · 1.01×
256
3.5µs · 1.00×
64
update_duals_joint2568
3.5µs · 1.05×
32
3.8µs · 1.09×
8
3.5µs · 1.09×
128
3.6µs · 1.00×
32
update_body_velocity25632
3.8µs · 1.08×
128
4.3µs · 1.25×
8
5.9µs · 1.40×
256
5.5µs · 1.00×
8
update_cable_dahl_state25616
3.7µs · 1.06×
16

Rigid end-to-end (full step)

exampledefault µs/stepoverall-rec µs (×)per-example-opt (×)
twist343.6290.3 (1.18×)289.9 (1.19×)
pile372.7284.7 (1.31×)291.3 (1.28×)
bundle_hysteresis506.1369.8 (1.37×)374.2 (1.35×)
y_junction333.8309.6 (1.08×)309.0 (1.08×)
2026-07-02-rigid-per-kernel.png 2026-07-02-rigid-endtoend.png

Analysis — why do some kernels want such a small block_dim?

A block of 8 threads = one warp with only 8/32 lanes active (¾ wasted). It still wins — for three different reasons. All numbers below are measured from the compiled SASS (cuobjdump), since Nsight Compute couldn't run here (see note).

Detection → bd=8

Carries a per-thread BVH stack in shared memory = 132 B/thread, so shared/block = 132×block_dim. Big blocks starve occupancy; it's memory-latency-bound, so more resident warps = faster. Divergent traversal → full warps can't use the SIMD width anyway, so the wasted lanes cost ~nothing.

accumulate_self_contact → bd=256

Only 64 B fixed shared + coherent arithmetic (every thread same math). Occupancy is block_dim-independent and full 32-lane warps give full SIMD throughput → big blocks win. Same solver, opposite optimum — proves the mechanism.

Rigid kernels → bd=8–64

No shared stack — just tiny grids. solve_rigid_body launches ~50–100 items → 1 block at bd=256 → only 1 of 142 SMs. Small blocks spread work across more SMs. (Optimum shifts with scene size, unlike detection.)

Measured resource usage (sm_89 cubin)

kerneloptimumregisters/threadshared/blockexecution
vertex_triangle detection870132 × block_dimdivergent (BVH)
edge_colliding detection8105132 × block_dimdivergent (BVH)
accumulate_self_contact25616064 (fixed)coherent
forward_step~1283064 (fixed)coherent
update_velocity~1282364 (fixed)coherent

132 B/thread × 512 = 67584 B (0x10800) > 48 KB → exactly the ptxas wall that fails detection at bd≥512.

Detection occupancy vs block_dim (theoretical, 48–64 KB shared carveout)

block_dimresident warps/SMwarp occupancylimited by
82450%24-block/SM cap (shared negligible)
3211–1522–31%shared memory
256816%shared (one block barely fits)

Registers (70/105) cap occupancy at ~28 warps regardless of block_dim → not the differentiator; shared memory is. Small block_dim → 2–3× more resident warps → ≈ the 2× measured speedup.

Nsight note. Per-kernel counters (achieved occupancy, DRAM/L2 throughput, warp-stall reasons) come from Nsight Compute (ncu), not Systems/nsys. ncu 2024.3.2 is installed but couldn't profile here: GPU perf counters need admin (sudo clears that), then ncu fails to instrument the CUDA-12.9 kernels (failed to prepare kernel — version gap). So resource usage came from cuobjdump (static SASS) + an occupancy model + the timing sweep. A CUDA-12.9-matched ncu would let us quantify the divergence-vs-latency split exactly.
Takeaway. Small-block wins are a symptom of under-utilization (detection at bd=8 = 50% warp occupancy but only ~12% thread occupancy). The deeper fix is algorithmic — stackless/less-divergent BVH traversal, or more parallelism per rigid body — which would let these run full warps at high occupancy and beat any block_dim. Out of scope here (launch-dimension tuning only); flagged as follow-up.

Follow-ups (not covered)

Generated from results/2026-07-01-isolated.json, 2026-07-01-endtoend.json, 2026-07-02-rigid-isolated.json, 2026-07-02-rigid-endtoend.json + cuobjdump SASS. Full write-up: 2026-07-01-report.md. Harness: scripts/.