A collection of parallel and performant algorithms (in robotics)

There was a paper some time ago that presents a really fast motion planner, using amongst other things SIMD programming for collision checking of edges (which is usually the most time intensive step in motion planning for robots). This got me more into SIMD programming, and hoarding a handful of links related to high-performing implementations of common robotics algorithms. This post is me closing a bunch of tabs and archiving their content here in case I'll need it again.

The paper/preprint in question is the one here by Zak and Wil. They present a really fast approach for motion planning, and have open sourced its implementation here. One of the things that they leverage is the parallelization of collision checking, which they call fine-grained collision checking. Compared to ‘traditional’ parallelization over multiple cores, they do parallel collision checking using vectorization with SIMD.

Since they took a while to open-source their code, I made some experiments myself, which is in very rough shape, but can be found on github. This shows a ~10x speedup on 2D collision checking (albeit not demoed in a motion planning setting). There will be a full post on this at some point, and I’ll hopefully also clean the mess there up a bit.

In the rest of the note, I want to organize a few links on SIMD blog-posts, posts on how to achieve high performance implementations of various algorithms (matmul for example), and a large part of robotics papers/algorithms.

There are likely many, many more very relevant and very interesting posts/papers and other forms of writeups. These are just the ones I stumbled across.

A while ago there was a discussion on hackernews of an old (2011) post on SIMD vs SIMT vs SMT. However, the consensus seems to be that quite a lot of things have changed since then, and the differences between the different programming models are not as distinguished anymore since the three different approaches borrowed many things from each other.
Here’s a great walkthrough of transitioning an algorithm implemented normally to SIMD-code. There is a nice intro to what SIMD programming actually is, and why it is faster than ‘normal’ code.
Here’s an overview of how teaching of parallel programming is done by Brian Plancher, a robotics professor at Columbia.
This list would not be complete without the intel intrinsic guide or agner fogs optimization resources which are both crucial resources for anyone trying to get started with SIMD programming (or code optimizations in general).
I also found this simd tutorial useful.
Finally, here is a great writeup of how to achieve fast matmul in cpp on a cpu.
And here is something similar for optimizing a kernel for matmul on a GPU.

Planning: Sampling based and trajectory optimization

Clayton W. Ramsey, Wil and Zak had another paper on cache/parallelization friendly nearest neighbor structures. There is also a blog post on its contents.
Brian Placher (the same as above) and gang have a paper on a parallel implementation of DDP. It is parallel on various granularities, i.e. on an algorithmic level (leveraging gauss newton multiple shooting), and also on an implementation level. Their conclusion is that the parallelizations they analyzed can help achieve faster convergence in some cases, but how well it works i sheavily dependent on the problem.
The paper here proposes a parallel version of something similar to rrt, and achieves fast planing times. The paper is from 2017, and the paper from Wil and Zak proposes a much faster planner.
cuRobo from nvidia is a collection of various things to achieve good motion planning times. The report is quite interesting to read. At the time of writing, cuRobo does not work for more than two robots, and is only doing motion planning.
Motion planing templates is a paper suggesting templated motion planning in order to be able to specialize to specific robot geometries at compile time. It is from Brian Ichnowski who is generally doing cool things in robotics.

Control

Gauss Newton multiple shooting is a combination of dynamic programming and multiple shooting, which allows parallelization of the algorithm. This is the paper that introduces the algorithm level parallelization used in the DDP paper above.
MPPI is a method that relies on sampling noise, using it as control input to a dynamic system and rolling it out over a certain prediction-horizon. Here is a collection of research contributions from Gatech.
MPPI is implemented on a GPU for robot locomotion, or for aggressive driving
Accelerating Robot Dynamics Gradients on a CPU, GPU, and FPGA
Nvidia has more work on implementing things on GPUs, here with the example of running MPPI on a gpu for manipulation, called STORM.
Tinympc is an MPC solver for embedded systems, that also has codegeneration capabilities.
Real time iterations with MPC is a core contribution that shows how to apply (nonlinear) MPC in reality, by only doing a few iterations at a time, and therefore guaranteeing compute times of the control algorithm.

Optimizers

OSQP is one of the most used solvers for quadratic programs.
Clarabel is new, implemented in rust, being able to deal with cones.
Crocoddyl is a formulation of DDP for contact-switched systems.
Altro is a fast solver for constrained traj. opt.
The relevance of scaling in your formulation of an optimization problem, and something similar in robotics.
On the relevance of sparsity in MPC transcriptions, and a second time.

I am interested in everything performace related similar to the things above. If you think you have something that is relevant to the list, please let me know!

Generic cool performance related literature

Robotics related literature (papers)