That is, basically, you just rotate and use the 4 bit centroids given that the distribution is known, so you don't need min/max, and notably, once you have that, you can multiply using a lookup table of 256 elements when doing the dot product, since two vectors have the same scale. The important point here is that for this use case it is NOT worth to use the 1 bit residual, since for the dot product, vector-x-quant you have a fast path, but quant-x-quant you don't have it, and anyway the recall difference is small. However, on top of that, remember that new learned embeddings tend to use all the components in a decent way, so you gain some recall for sure, but not as much as in the case of KV cache.
- Slightly improved recall
- Faster index creation
- Online addition of vectors without recalibrating the index
The last point in particular is a big infrastructure win I think.
Cool to see the same WHT + Lloyd-Max math applied to vector search. The data-oblivious codebook property is exactly what makes it work for online KV cache compression too. No calibration, no training, just quantize and go.
If anyone is running local LLMs and wants to try it: https://github.com/TheTom/turboquant_plus/blob/main/docs/get...
The repo reproduces the benchmarks from Section 4.4 of the paper — recall@1@k on GloVe (d=200) and OpenAI embeddings (d=1536, d=3072). At 4-bit on d=1536, you get 0.967 recall@1@1 with 8x compression. At 2-bit, 0.862 recall@1@1 with ~16x compression.