Home/Blog/Why Is Vector Search So Fast? HNSW and IVF-PQ Explained With the Math
Back to Blog
Feb 28, 2026RAG & Retrieval

Why Is Vector Search So Fast? HNSW and IVF-PQ Explained With the Math

A walkthrough of approximate nearest neighbor search covering HNSW graphs, inverted file indexes, product quantization, and IVF-PQ with worked examples and memory analysis.

Share on X

Abstract

A comparison of exact and approximate nearest neighbor search methods, with emphasis on the algorithmic and mathematical foundations of Hierarchical Navigable Small World graphs (HNSW) and the Inverted File Index with Product Quantization (IVF-PQ). This post formalizes the approximate nearest neighbor problem, derives complexity bounds, analyzes memory footprints, and provides worked numerical examples on datasets of practical scale. The goal is to make precise why these methods achieve sub-linear query times and orders-of-magnitude memory reductions relative to brute-force search.

1. The Nearest Neighbor Problem

Let X={x1,x2,,xN}X = \{x_1, x_2, \dots, x_N\} be a dataset of NN vectors in Rd\mathbb{R}^d, and let d(,)d(\cdot, \cdot) be a distance function (typically Euclidean L2L_2 or inner product). Given a query vector qRdq \in \mathbb{R}^d and an integer k1k \geq 1, the exact \(k\)-nearest neighbor problem asks for the set

S=arg minSX,  S=k  maxxS  d(q,x)S^* = \underset{S \subseteq X,\; |S|=k}{\operatorname{arg\,min}} \; \max_{x \in S}\; d(q, x)

that is, the kk vectors in XX closest to qq under dd.

The approximate variant relaxes this: return a set S^\hat{S} such that, with high probability, for every x^S^\hat{x} \in \hat{S} and corresponding true neighbor xSx^* \in S^*,

d(q,x^)(1+ϵ)d(q,x)d(q, \hat{x}) \leq (1 + \epsilon)\, d(q, x^*)

for some approximation factor ϵ>0\epsilon > 0. In practice the community measures quality via recall@k, the fraction of true top-kk neighbors returned by the approximate algorithm.

2. Brute Force Baseline

The naive approach computes d(q,xi)d(q, x_i) for every i{1,,N}i \in \{1, \dots, N\} and selects the kk smallest distances.

Time complexity. Each distance computation in Rd\mathbb{R}^d costs O(d)O(d), so the total is

Tbrute=O(Nd)T_{\text{brute}} = O(Nd)

Memory. Storing the full dataset in float32 requires

Mbrute=4Nd  bytesM_{\text{brute}} = 4Nd \;\text{bytes}

Worked example. For N=106N = 10^6 vectors of dimension d=768d = 768:

  • Distance computations per query: 106×7687.7×10810^6 \times 768 \approx 7.7 \times 10^8 FLOPs.
  • Memory: 4×106×768=3.072×1094 \times 10^6 \times 768 = 3.072 \times 10^9 bytes 2.86\approx 2.86 GB.

At a throughput of roughly 101010^{10} FLOPs/s on a modern CPU core, this yields latencies on the order of 77 ms per query — tolerable for small workloads, but untenable at scale. The two families of methods described next reduce either the number of distance computations, the cost per computation, or both.

3. HNSW (Hierarchical Navigable Small World)

HNSW hierarchical layer structure
HNSW: Hierarchical Navigable Small World Graph

3.1 Navigable Small-World Graphs

A navigable small-world (NSW) graph places every vector xix_i as a node and connects each to a set of neighbors such that greedy routing converges quickly. The key structural property is that the graph possesses both short-range links (connecting nearby vectors) and long-range links (enabling logarithmic-hop traversal).

HNSW, introduced by Malkov and Yashunin (2018), layers multiple NSW graphs in a hierarchy. Layers are numbered 0,1,,L0, 1, \dots, L. Layer 0 contains all NN nodes; higher layers contain exponentially fewer nodes.

3.2 Layer Assignment

When inserting element xix_i, its maximum layer i\ell_i is drawn as

i=ln(uniform(0,1))mL\ell_i = \lfloor -\ln(\text{uniform}(0,1)) \cdot m_L \rfloor

where mLm_L is a normalization constant, typically set to mL=1/ln(M)m_L = 1 / \ln(M), with MM being the maximum number of connections per node. The probability that a node appears at layer \ell or above is

P(i)=exp ⁣(mL)=N/(mLlnN)P(\ell_i \geq \ell) = \exp\!\left(-\frac{\ell}{m_L}\right) = N^{-\ell / (m_L \ln N)}

Setting mL=1/ln(M)m_L = 1/\ln(M) yields an expected number of nodes at layer \ell of

E[layer ]=NM\mathbb{E}[|\text{layer } \ell|] = N \cdot M^{-\ell}

so the expected number of layers is

L=O ⁣(lnNlnM)=O(logMN)L = O\!\left(\frac{\ln N}{\ln M}\right) = O(\log_M N)

3.3 Greedy Search

Search begins at the top layer from a fixed entry point. At each layer, a greedy procedure expands a beam (priority queue) of width efef — called efSearch — by iterating over each candidate's neighbors and keeping the closest efef vectors found so far. When no improvement is possible at layer \ell, the current best candidate set descends to layer 1\ell - 1. At layer 0, the procedure terminates and returns the top kk vectors from the beam.

3.4 Complexity Analysis

At each layer, greedy search visits O(efM)O(ef \cdot M) nodes (expanding efef candidates, each with MM neighbors). Across L=O(logMN)L = O(\log_M N) layers, the total number of distance computations is

THNSW=O ⁣(efMlogMN)=O ⁣(efMlnNlnM)T_{\text{HNSW}} = O\!\left(ef \cdot M \cdot \log_M N\right) = O\!\left(ef \cdot M \cdot \frac{\ln N}{\ln M}\right)

For fixed efef and MM, this is O(logN)O(\log N) — exponentially better than brute force.

3.5 Recall–Latency Tradeoff

The parameter efef directly governs the recall–latency tradeoff. Larger efef means more candidates are evaluated, increasing recall at the cost of more distance computations. Empirically, on 1M-scale datasets with d=768d = 768 and M=16M = 16:

| efef | Recall@10 | Relative latency | |--------|-----------|------------------| | 16 | ~0.85 | 1× | | 64 | ~0.95 | 3.5× | | 256 | ~0.99 | 12× |

Even at ef=256ef = 256, total distance computations are on the order of 256×16×log16(106)256×16×520,480256 \times 16 \times \log_{16}(10^6) \approx 256 \times 16 \times 5 \approx 20{,}480, a factor of roughly 50× fewer than brute force.

3.6 Memory

HNSW stores the full-precision vectors plus the graph. Graph overhead per node is approximately MLavg4M \cdot L_{\text{avg}} \cdot 4 bytes (storing neighbor IDs as int32). With M=16M=16 and Lavg1.3L_{\text{avg}} \approx 1.3 layers per node on average, this is roughly 83 bytes per vector — modest compared to the vector storage itself. Total memory for the worked example:

MHNSW4Nd+NMLavg42.86  GB+83  MB2.94  GBM_{\text{HNSW}} \approx 4Nd + N \cdot M \cdot L_{\text{avg}} \cdot 4 \approx 2.86\;\text{GB} + 83\;\text{MB} \approx 2.94\;\text{GB}

HNSW reduces *compute* dramatically but provides little *memory* savings over brute force. This motivates quantization-based approaches.

4. IVF (Inverted File Index)

4.1 Voronoi Partitioning

The Inverted File Index partitions the dataset XX into nlistn_{\text{list}} clusters via kk-means (or a similar algorithm). Let {c1,c2,,cnlist}\{c_1, c_2, \dots, c_{n_{\text{list}}}\} be the centroids. Each vector xix_i is assigned to its nearest centroid:

σ(xi)=arg minj{1,,nlist}  d(xi,cj)\sigma(x_i) = \underset{j \in \{1,\dots,n_{\text{list}}\}}{\operatorname{arg\,min}}\; d(x_i, c_j)

This induces a Voronoi partition of Rd\mathbb{R}^d. Each cell Vj={xi:σ(xi)=j}V_j = \{x_i : \sigma(x_i) = j\} stores an inverted list of its members.

4.2 Query Procedure

At query time, the algorithm first computes d(q,cj)d(q, c_j) for all jj and selects the nproben_{\text{probe}} closest centroids. It then performs exhaustive search only within those nproben_{\text{probe}} inverted lists:

TIVF=O ⁣(nlistd+nprobeNnlistd)T_{\text{IVF}} = O\!\left(n_{\text{list}} \cdot d + n_{\text{probe}} \cdot \frac{N}{n_{\text{list}}} \cdot d\right)

The first term is the centroid comparison cost; the second is the exhaustive scan within selected cells. When nlist=Nn_{\text{list}} = \sqrt{N}, the second term becomes O(nprobeNd)O(n_{\text{probe}} \cdot \sqrt{N} \cdot d).

4.3 Recall as a Function of nprobe/nlistn_{\text{probe}}/n_{\text{list}}

Under the simplifying assumption that true nearest neighbors are uniformly distributed across Voronoi cells, the probability that a given true neighbor falls within the probed cells is approximately

P(hit)nprobenlistP(\text{hit}) \approx \frac{n_{\text{probe}}}{n_{\text{list}}}

For kk independent neighbors, the expected recall is

E[recall@k]nprobenlist\mathbb{E}[\text{recall@}k] \approx \frac{n_{\text{probe}}}{n_{\text{list}}}

In practice, neighbors cluster near cell boundaries, so the true recall is higher than this lower bound for reasonable values of nproben_{\text{probe}}. Typical configurations use nlist=1024n_{\text{list}} = 1024 to 40964096 and nprobe=8n_{\text{probe}} = 8 to 6464, achieving recall@10 above 0.90 while scanning fewer than 5% of the dataset.

4.4 Worked Example

With N=106N = 10^6, d=768d = 768, nlist=1024n_{\text{list}} = 1024, nprobe=16n_{\text{probe}} = 16:

  • Vectors per cell: 106/102497710^6 / 1024 \approx 977.
  • Vectors scanned: 16×977=15,63216 \times 977 = 15{,}632.
  • Distance computations: 1024×768+15,632×76812.8×1061024 \times 768 + 15{,}632 \times 768 \approx 12.8 \times 10^6.
  • Speedup vs. brute force: 7.7×108/1.28×10760×7.7 \times 10^8 / 1.28 \times 10^7 \approx 60\times.

Memory remains O(Nd)O(Nd) since the raw vectors are still stored.

5. Product Quantization

5.1 Subspace Decomposition

Product Quantization (PQ), introduced by Jégou et al. (2011), compresses each dd-dimensional vector into a short code. The vector space Rd\mathbb{R}^d is decomposed into MM disjoint subspaces of dimension d/Md/M each:

x=[x(1),x(2),,x(M)],x(m)Rd/Mx = [x^{(1)}, x^{(2)}, \dots, x^{(M)}], \quad x^{(m)} \in \mathbb{R}^{d/M}

5.2 Codebook Learning

For each subspace mm, a codebook C(m)={c1(m),,ck(m)}C^{(m)} = \{c^{(m)}_1, \dots, c^{(m)}_{k^*}\} of kk^* centroids is learned via kk-means on the projected sub-vectors {xi(m)}i=1N\{x_i^{(m)}\}_{i=1}^N. Each sub-vector is then replaced by its nearest centroid index:

q(m)(x)=arg minj{1,,k}  x(m)cj(m)2q^{(m)}(x) = \underset{j \in \{1,\dots,k^*\}}{\operatorname{arg\,min}}\; \|x^{(m)} - c^{(m)}_j\|^2

The PQ code for vector xx is the tuple of indices (q(1)(x),,q(M)(x))(q^{(1)}(x), \dots, q^{(M)}(x)).

5.3 Memory Per Vector

Each index requires log2k\lceil \log_2 k^* \rceil bits. With the standard choice k=256k^* = 256, each index is exactly 1 byte. The compressed representation of a single vector requires

MPQ=Mlog2k8=M  bytes(when k=256)M_{\text{PQ}} = M \cdot \frac{\log_2 k^*}{8} = M \;\text{bytes} \quad (\text{when } k^* = 256)

5.4 Worked Example: Memory Savings

For N=106N = 10^6, d=768d = 768, M=96M = 96 (subspace dimension = 8), k=256k^* = 256:

  • Raw storage: 4×106×768=2.864 \times 10^6 \times 768 = 2.86 GB.
  • PQ storage: 96×106=9696 \times 10^6 = 96 MB.
  • Compression ratio: 30.7×.

5.5 Asymmetric Distance Computation (ADC)

At query time, exact distance from query qq to a database vector xx (represented by its PQ code) is approximated via asymmetric distance computation. The key insight: precompute a lookup table of sub-vector distances.

For each subspace mm and each centroid jj, compute

Dj(m)=q(m)cj(m)2D^{(m)}_j = \|q^{(m)} - c^{(m)}_j\|^2

This produces M×kM \times k^* entries — a table of shape (M,256)(M, 256). The approximate squared distance from qq to xx is then

d^(q,x)2=m=1MDq(m)(x)(m)\hat{d}(q, x)^2 = \sum_{m=1}^{M} D^{(m)}_{q^{(m)}(x)}

Each distance requires only MM table lookups and additions — no floating-point multiplications.

Lookup table construction cost: M×k×(d/M)=k×dM \times k^* \times (d/M) = k^* \times d FLOPs — independent of NN.

Per-vector distance cost: MM lookups + M1M-1 additions =O(M)= O(M).

For the worked example (M=96M = 96):

  • Table construction: 256×768=196,608256 \times 768 = 196{,}608 FLOPs (one-time per query).
  • Per-vector distance: 96 lookups.
  • Total for scanning all 1M vectors: 96×106=9.6×10796 \times 10^6 = 9.6 \times 10^7 operations.
  • Compare to brute force: 7.7×1087.7 \times 10^8 FLOPs — an 8× reduction in compute, on top of the 30× memory reduction.

6. IVF-PQ Combined

Memory usage comparison across methods
Memory Usage: 1M Vectors × 768 Dimensions

6.1 Two-Stage Pipeline

IVF-PQ combines both ideas into a two-stage pipeline:

  1. Coarse quantization (IVF): Partition the dataset into nlistn_{\text{list}} Voronoi cells. At query time, select the nproben_{\text{probe}} nearest cells.
  2. Fine quantization (PQ): Within the selected cells, vectors are stored as PQ codes. Distances are computed via ADC.

An important refinement: rather than quantizing the raw vectors, IVF-PQ typically stores and quantizes the residuals

ri=xicσ(xi)r_i = x_i - c_{\sigma(x_i)}

where cσ(xi)c_{\sigma(x_i)} is the coarse centroid. Since residuals have smaller magnitude and lower variance than the original vectors, PQ achieves higher fidelity on them.

6.2 Query Procedure

Given query qq:

  1. Compute d(q,cj)d(q, c_j) for all nlistn_{\text{list}} centroids. Select top nproben_{\text{probe}}.
  2. For each selected centroid cjc_j, compute the query residual rq(j)=qcjr_q^{(j)} = q - c_j.
  3. Build an ADC lookup table for rq(j)r_q^{(j)}.
  4. Scan all PQ codes in cell jj, computing approximate distances via ADC.
  5. Merge results across probed cells, return top kk.

6.3 Compute Analysis

Let nscan=nprobe×N/nlistn_{\text{scan}} = n_{\text{probe}} \times N / n_{\text{list}} be the total number of vectors scanned. The computational cost breaks down as:

TIVF-PQ=nlistdcoarse search+nprobekdtable construction+nscanMADC lookupsT_{\text{IVF-PQ}} = \underbrace{n_{\text{list}} \cdot d}_{\text{coarse search}} + \underbrace{n_{\text{probe}} \cdot k^* \cdot d}_{\text{table construction}} + \underbrace{n_{\text{scan}} \cdot M}_{\text{ADC lookups}}

6.4 Memory Analysis

The index stores:

  • Centroids: nlist×d×4n_{\text{list}} \times d \times 4 bytes.
  • PQ codebooks: M×k×(d/M)×4=k×d×4M \times k^* \times (d/M) \times 4 = k^* \times d \times 4 bytes.
  • PQ codes: N×MN \times M bytes.
  • Vector IDs: N×8N \times 8 bytes (int64).

Total:

MIVF-PQ=N(M+8)+(nlist+k)d4M_{\text{IVF-PQ}} = N(M + 8) + (n_{\text{list}} + k^*) \cdot d \cdot 4

6.5 Worked Example: Full Pipeline

Parameters: N=106N = 10^6, d=768d = 768, M=96M = 96, k=256k^* = 256, nlist=1024n_{\text{list}} = 1024, nprobe=16n_{\text{probe}} = 16.

Memory:

| Component | Size | |-----------|------| | PQ codes (N×MN \times M) | 106×96=9610^6 \times 96 = 96 MB | | Vector IDs (N×8N \times 8) | 8 MB | | Centroids (1024×768×41024 \times 768 \times 4) | 3.0 MB | | PQ codebooks (256×768×4256 \times 768 \times 4) | 0.75 MB | | Total | ~108 MB |

Compared to brute-force storage of 2.86 GB, this represents a 26.5× compression.

Compute per query:

| Stage | Operations | |-------|-----------| | Coarse search | 1024×768=786,4321024 \times 768 = 786{,}432 | | Table construction | 16×256×768=3,145,72816 \times 256 \times 768 = 3{,}145{,}728 | | ADC scan (15,63215{,}632 vectors ×96\times 96 lookups) | 1,500,6721{,}500{,}672 | | Total | ~5.4 × 10\(^6\) |

Compared to brute-force (7.7×1087.7 \times 10^8), the speedup is approximately 143×, while the memory footprint has been reduced by a factor of 26.5.

7. Practical Tradeoffs

Recall vs latency tradeoff
Recall vs Latency Tradeoff (IVF-PQ)

Accuracy vs. Speed

Both HNSW and IVF-PQ trade exactness for speed, but they occupy different points in the design space:

  • HNSW achieves very high recall (>0.99>0.99) with moderate compute overhead. Its graph structure makes it well-suited for latency-critical applications where memory is abundant.
  • IVF-PQ sacrifices some recall (typically 0.85–0.95 depending on parameters) in exchange for dramatic memory savings. It is the method of choice when the dataset does not fit in RAM at full precision.

Memory vs. Recall

The relationship between PQ parameters and recall is governed by the quantization distortion. The expected squared distortion per subspace is bounded by the kk-means quantization error:

E[x(m)cq(m)(x)(m)2]σd/M2(k)\mathbb{E}\left[\|x^{(m)} - c^{(m)}_{q^{(m)}(x)}\|^2\right] \leq \sigma^2_{d/M}(k^*)

where σd/M2(k)\sigma^2_{d/M}(k^*) is the optimal quantization error for a (d/M)(d/M)-dimensional distribution with kk^* centroids. Increasing MM (more subspaces, each of lower dimension) or increasing kk^* reduces distortion but increases code size. In practice, the subspace dimension d/Md/M between 4 and 16 offers the best balance.

Hybrid Approaches

Modern systems often combine both methods. A common architecture uses IVF-PQ as a first-pass retrieval stage, followed by re-ranking with exact distances on the original (or HNSW-indexed) vectors. This two-stage pattern extracts the memory efficiency of PQ and the precision of full-vector comparison:

  1. IVF-PQ retrieves the top kkk' \gg k candidates (e.g., k=10kk' = 10k).
  2. Original vectors for these kk' candidates are fetched and re-ranked.
  3. The final top kk are returned.

The re-ranking step costs O(kd)O(k' \cdot d), which is negligible for moderate kk'.

Scaling Considerations

| Metric | Brute Force | HNSW | IVF-PQ | |--------|-------------|------|--------| | Query time | O(Nd)O(Nd) | O(logN)O(\log N) | O(NM)O(\sqrt{N} \cdot M) | | Memory | 4Nd4Nd | 4Nd\approx 4Nd | N(M+8)N(M+8) | | Index build | O(1)O(1) | O(NlogN)O(N \log N) | O(NdI)O(NdI)* | | Update support | Trivial | Incremental | Requires retraining |

\*II denotes kk-means iterations.

HNSW supports incremental insertion, making it suitable for streaming workloads. IVF-PQ requires training on a representative sample and is better suited for static or batch-updated corpora.

8. Conclusion

The transition from brute-force to approximate nearest neighbor search is not a single trick but a composition of orthogonal ideas. HNSW reduces the *number* of distance computations from O(N)O(N) to O(logN)O(\log N) by organizing vectors in a multi-layer navigable graph. IVF reduces the search scope to a fraction of the dataset via Voronoi partitioning. Product Quantization reduces the *cost* of each distance computation and the *memory* per vector by compressing dd-dimensional vectors into MM-byte codes. IVF-PQ combines these last two into a system that can index a billion vectors in tens of gigabytes of RAM with query latencies in the low milliseconds.

Understanding the mathematical underpinnings — layer probability distributions, Voronoi cell geometry, subspace quantization distortion, and asymmetric distance computation — is essential for tuning these systems in production. The parameters efef, MM, nlistn_{\text{list}}, nproben_{\text{probe}}, and the PQ subspace configuration (M,k)(M, k^*) are not arbitrary knobs; they are direct consequences of the tradeoffs formalized above.

References

  1. Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 42(4), 824–836.
  2. Jégou, H., Douze, M., & Schmid, C. (2011). Product Quantization for Nearest Neighbor Search. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 33(1), 117–128.
  3. Johnson, J., Douze, M., & Jégou, H. (2019). Billion-Scale Similarity Search with GPUs. *IEEE Transactions on Big Data*, 7(3), 535–547.
  4. Babenko, A., & Lempitsky, V. (2014). The Inverted Multi-Index. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 37(6), 1247–1260.
  5. Ge, T., He, K., Ke, Q., & Sun, J. (2014). Optimized Product Quantization. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 36(4), 744–755.
  6. Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.-E., Lomeli, M., Hosseini, L., & Jégou, H. (2024). The Faiss Library. *arXiv preprint arXiv:2401.08281*.