I took a look at the code, and read the paper. It seems that they directly calculate the entire 2D DP array, but use SIMD to allow each cell to contain multiple values, one for each query string. My approach uses anti-diagonals instead, but it is fast for one vs one comparisons, instead of handling multiple query strings.
Regardless, my goal was to learn some SIMD and Rust (first time for both), so I did not read many background papers.
One thing to keep in mind is for SIMD memory locality is very important; a diagonal vector with a standard 2D DP grid is gonna lead to a lot of cache misses. Just something else to learn about.
Since I am storing the entire DP matrix as diagonal vectors that are flattened, I don't think there will be many cache misses. Each diagonal only depends on its previous two diagonals, and each diagonal is stored contiguously in memory.
The problem with handling diagonals is that indexing cells and comparing characters on the diagonal becomes complex. Dealing with this without many branches (less branch mispredictions) is the hard part.
https://github.com/langmead-lab/vargas