Vector Search Explained: HNSW, IVF-PQ, and ANN Speed

Abstract

A comparison of exact and approximate nearest neighbor search methods, with emphasis on the algorithmic and mathematical foundations of Hierarchical Navigable Small World graphs (HNSW) and the Inverted File Index with Product Quantization (IVF-PQ). The approximate nearest neighbor problem can be made precise with complexity bounds, memory footprints, and worked numerical examples on practical dataset sizes. Those details explain why these methods achieve sub-linear query times and orders-of-magnitude memory reductions relative to brute-force search.

1. The Nearest Neighbor Problem

Let $X = \{x_1, x_2, \dots, x_N\}$ be a dataset of $N$ vectors in $\mathbb{R}^d$ , and let $d(\cdot, \cdot)$ be a distance function (typically Euclidean $L_2$ or inner product). Given a query vector $q \in \mathbb{R}^d$ and an integer $k \geq 1$ , the exact $k$-nearest neighbor problem asks for the set

S^* = \underset{S \subseteq X,\; |S|=k}{\operatorname{arg\,min}} \; \max_{x \in S}\; d(q, x)

that is, the $k$ vectors in $X$ closest to $q$ under $d$ .

The approximate variant relaxes this: return a set $\hat{S}$ such that, with high probability, for every $\hat{x} \in \hat{S}$ and corresponding true neighbor $x^* \in S^*$ ,

d(q, \hat{x}) \leq (1 + \epsilon)\, d(q, x^*)

for some approximation factor $\epsilon > 0$ . In practice the community measures quality via recall@k, the fraction of true top- $k$ neighbors returned by the approximate algorithm.

2. Brute Force Baseline

The naive approach computes $d(q, x_i)$ for every $i \in \{1, \dots, N\}$ and selects the $k$ smallest distances.

Time complexity. Each distance computation in $\mathbb{R}^d$ costs $O(d)$ , so the total is

T_{\text{brute}} = O(Nd)

Memory. Storing the full dataset in float32 requires

M_{\text{brute}} = 4Nd \;\text{bytes}

Worked example. For $N = 10^6$ vectors of dimension $d = 768$ :

Distance computations per query: $10^6 \times 768 \approx 7.7 \times 10^8$ FLOPs.
Memory: $4 \times 10^6 \times 768 = 3.072 \times 10^9$ bytes $\approx 2.86$ GB.

At a throughput of roughly $10^{10}$ FLOPs/s on a modern CPU core, this yields latencies on the order of 77 ms per query — tolerable for small workloads, but untenable at scale. The two families of methods described next reduce either the number of distance computations, the cost per computation, or both.

3. HNSW (Hierarchical Navigable Small World)

HNSW hierarchical layer structure — HNSW: Hierarchical Navigable Small World Graph

3.1 Navigable Small-World Graphs

A navigable small-world (NSW) graph places every vector $x_i$ as a node and connects each to a set of neighbors such that greedy routing converges quickly. The key structural property is that the graph possesses both short-range links (connecting nearby vectors) and long-range links (enabling logarithmic-hop traversal).

HNSW, introduced by Malkov and Yashunin (2018), layers multiple NSW graphs in a hierarchy. Layers are numbered $0, 1, \dots, L$ . Layer 0 contains all $N$ nodes; higher layers contain exponentially fewer nodes.

3.2 Layer Assignment

When inserting element $x_i$ , its maximum layer $\ell_i$ is drawn as

\ell_i = \lfloor -\ln(\text{uniform}(0,1)) \cdot m_L \rfloor

where $m_L$ is a normalization constant, typically set to $m_L = 1 / \ln(M)$ , with $M$ being the maximum number of connections per node. The probability that a node appears at layer $\ell$ or above is

P(\ell_i \geq \ell) = \exp\!\left(-\frac{\ell}{m_L}\right) = N^{-\ell / (m_L \ln N)}

Setting $m_L = 1/\ln(M)$ yields an expected number of nodes at layer $\ell$ of

\mathbb{E}[|\text{layer } \ell|] = N \cdot M^{-\ell}

so the expected number of layers is

L = O\!\left(\frac{\ln N}{\ln M}\right) = O(\log_M N)

3.3 Greedy Search

Search begins at the top layer from a fixed entry point. At each layer, a greedy procedure expands a beam (priority queue) of width $ef$ — called efSearch — by iterating over each candidate's neighbors and keeping the closest $ef$ vectors found so far. When no improvement is possible at layer $\ell$ , the current best candidate set descends to layer $\ell - 1$ . At layer 0, the procedure terminates and returns the top $k$ vectors from the beam.

3.4 Complexity Analysis

At each layer, greedy search visits $O(ef \cdot M)$ nodes (expanding $ef$ candidates, each with $M$ neighbors). Across $L = O(\log_M N)$ layers, the total number of distance computations is

T_{\text{HNSW}} = O\!\left(ef \cdot M \cdot \log_M N\right) = O\!\left(ef \cdot M \cdot \frac{\ln N}{\ln M}\right)

For fixed $ef$ and $M$ , this is $O(\log N)$ — exponentially better than brute force.

3.5 Recall–Latency Tradeoff

The parameter $ef$ directly governs the recall–latency tradeoff. Larger $ef$ means more candidates are evaluated, increasing recall at the cost of more distance computations. Empirically, on 1M-scale datasets with $d = 768$ and $M = 16$ :

$ef$	Recall@10	Relative latency
16	~0.85	1×
64	~0.95	3.5×
256	~0.99	12×

Even at $ef = 256$ , total distance computations are on the order of $256 \times 16 \times \log_{16}(10^6) \approx 256 \times 16 \times 5 \approx 20{,}480$ , a factor of roughly 50× fewer than brute force.

3.6 Memory

HNSW stores the full-precision vectors plus the graph. Graph overhead per node is approximately $M \cdot L_{\text{avg}} \cdot 4$ bytes (storing neighbor IDs as int32). With $M=16$ and $L_{\text{avg}} \approx 1.3$ layers per node on average, this is roughly 83 bytes per vector — modest compared to the vector storage itself. Total memory for the worked example:

M_{\text{HNSW}} \approx 4Nd + N \cdot M \cdot L_{\text{avg}} \cdot 4 \approx 2.86\;\text{GB} + 83\;\text{MB} \approx 2.94\;\text{GB}

HNSW reduces *compute* dramatically but provides little *memory* savings over brute force. This motivates quantization-based approaches.

4. IVF (Inverted File Index)

4.1 Voronoi Partitioning

The Inverted File Index partitions the dataset $X$ into $n_{\text{list}}$ clusters via $k$ -means (or a similar algorithm). Let $\{c_1, c_2, \dots, c_{n_{\text{list}}}\}$ be the centroids. Each vector $x_i$ is assigned to its nearest centroid:

\sigma(x_i) = \underset{j \in \{1,\dots,n_{\text{list}}\}}{\operatorname{arg\,min}}\; d(x_i, c_j)

This induces a Voronoi partition of $\mathbb{R}^d$ . Each cell $V_j = \{x_i : \sigma(x_i) = j\}$ stores an inverted list of its members.

4.2 Query Procedure

At query time, the algorithm first computes $d(q, c_j)$ for all $j$ and selects the $n_{\text{probe}}$ closest centroids. It then performs exhaustive search only within those $n_{\text{probe}}$ inverted lists:

T_{\text{IVF}} = O\!\left(n_{\text{list}} \cdot d + n_{\text{probe}} \cdot \frac{N}{n_{\text{list}}} \cdot d\right)

The first term is the centroid comparison cost; the second is the exhaustive scan within selected cells. When $n_{\text{list}} = \sqrt{N}$ , the second term becomes $O(n_{\text{probe}} \cdot \sqrt{N} \cdot d)$ .

4.3 Recall as a Function of $n_{\text{probe}}/n_{\text{list}}$

Under the simplifying assumption that true nearest neighbors are uniformly distributed across Voronoi cells, the probability that a given true neighbor falls within the probed cells is approximately

P(\text{hit}) \approx \frac{n_{\text{probe}}}{n_{\text{list}}}

For $k$ independent neighbors, the expected recall is

\mathbb{E}[\text{recall@}k] \approx \frac{n_{\text{probe}}}{n_{\text{list}}}

In practice, neighbors cluster near cell boundaries, so the true recall is higher than this lower bound for reasonable values of $n_{\text{probe}}$ . Typical configurations use $n_{\text{list}} = 1024$ to $4096$ and $n_{\text{probe}} = 8$ to $64$ , achieving recall@10 above 0.90 while scanning fewer than 5% of the dataset.

4.4 Worked Example

With $N = 10^6$ , $d = 768$ , $n_{\text{list}} = 1024$ , $n_{\text{probe}} = 16$ :

Vectors per cell: $10^6 / 1024 \approx 977$ .
Vectors scanned: $16 \times 977 = 15{,}632$ .
Distance computations: $1024 \times 768 + 15{,}632 \times 768 \approx 12.8 \times 10^6$ .
Speedup vs. brute force: $7.7 \times 10^8 / 1.28 \times 10^7 \approx 60\times$ .

Memory remains $O(Nd)$ since the raw vectors are still stored.

5. Product Quantization

5.1 Subspace Decomposition

Product Quantization (PQ), introduced by Jégou et al. (2011), compresses each $d$ -dimensional vector into a short code. The vector space $\mathbb{R}^d$ is decomposed into $M$ disjoint subspaces of dimension $d/M$ each:

x = [x^{(1)}, x^{(2)}, \dots, x^{(M)}], \quad x^{(m)} \in \mathbb{R}^{d/M}

5.2 Codebook Learning

For each subspace $m$ , a codebook $C^{(m)} = \{c^{(m)}_1, \dots, c^{(m)}_{k^*}\}$ of $k^*$ centroids is learned via $k$ -means on the projected sub-vectors $\{x_i^{(m)}\}_{i=1}^N$ . Each sub-vector is then replaced by its nearest centroid index:

q^{(m)}(x) = \underset{j \in \{1,\dots,k^*\}}{\operatorname{arg\,min}}\; \|x^{(m)} - c^{(m)}_j\|^2

The PQ code for vector $x$ is the tuple of indices $(q^{(1)}(x), \dots, q^{(M)}(x))$ .

5.3 Memory Per Vector

Each index requires $\lceil \log_2 k^* \rceil$ bits. With the standard choice $k^* = 256$ , each index is exactly 1 byte. The compressed representation of a single vector requires

M_{\text{PQ}} = M \cdot \frac{\log_2 k^*}{8} = M \;\text{bytes} \quad (\text{when } k^* = 256)

5.4 Worked Example: Memory Savings

For $N = 10^6$ , $d = 768$ , $M = 96$ (subspace dimension = 8), $k^* = 256$ :

Raw storage: $4 \times 10^6 \times 768 = 2.86$ GB.
PQ storage: $96 \times 10^6 = 96$ MB.
Compression ratio: 30.7×.

5.5 Asymmetric Distance Computation (ADC)

At query time, exact distance from query $q$ to a database vector $x$ (represented by its PQ code) is approximated via asymmetric distance computation. The key insight: precompute a lookup table of sub-vector distances.

For each subspace $m$ and each centroid $j$ , compute

D^{(m)}_j = \|q^{(m)} - c^{(m)}_j\|^2

This produces $M \times k^*$ entries — a table of shape $(M, 256)$ . The approximate squared distance from $q$ to $x$ is then

\hat{d}(q, x)^2 = \sum_{m=1}^{M} D^{(m)}_{q^{(m)}(x)}

Each distance requires only $M$ table lookups and additions — no floating-point multiplications.

Lookup table construction cost: $M \times k^* \times (d/M) = k^* \times d$ FLOPs — independent of $N$ .

Per-vector distance cost: $M$ lookups + $M-1$ additions $= O(M)$ .

For the worked example ( $M = 96$ ):

Table construction: $256 \times 768 = 196{,}608$ FLOPs (one-time per query).
Per-vector distance: 96 lookups.
Total for scanning all 1M vectors: $96 \times 10^6 = 9.6 \times 10^7$ operations.
Compare to brute force: $7.7 \times 10^8$ FLOPs — an 8× reduction in compute, on top of the 30× memory reduction.

6. IVF-PQ Combined

Memory usage comparison across methods — Memory Usage: 1M Vectors × 768 Dimensions

6.1 Two-Stage Pipeline

IVF-PQ combines both ideas into a two-stage pipeline:

Coarse quantization (IVF): Partition the dataset into $n_{\text{list}}$ Voronoi cells. At query time, select the $n_{\text{probe}}$ nearest cells.
Fine quantization (PQ): Within the selected cells, vectors are stored as PQ codes. Distances are computed via ADC.

An important refinement: rather than quantizing the raw vectors, IVF-PQ typically stores and quantizes the residuals

r_i = x_i - c_{\sigma(x_i)}

where $c_{\sigma(x_i)}$ is the coarse centroid. Since residuals have smaller magnitude and lower variance than the original vectors, PQ achieves higher fidelity on them.

6.2 Query Procedure

Given query $q$ :

Compute $d(q, c_j)$ for all $n_{\text{list}}$ centroids. Select top $n_{\text{probe}}$ .
For each selected centroid $c_j$ , compute the query residual $r_q^{(j)} = q - c_j$ .
Build an ADC lookup table for $r_q^{(j)}$ .
Scan all PQ codes in cell $j$ , computing approximate distances via ADC.
Merge results across probed cells, return top $k$ .

6.3 Compute Analysis

Let $n_{\text{scan}} = n_{\text{probe}} \times N / n_{\text{list}}$ be the total number of vectors scanned. The computational cost breaks down as:

T_{\text{IVF-PQ}} = \underbrace{n_{\text{list}} \cdot d}_{\text{coarse search}} + \underbrace{n_{\text{probe}} \cdot k^* \cdot d}_{\text{table construction}} + \underbrace{n_{\text{scan}} \cdot M}_{\text{ADC lookups}}

6.4 Memory Analysis

The index stores:

Centroids: $n_{\text{list}} \times d \times 4$ bytes.
PQ codebooks: $M \times k^* \times (d/M) \times 4 = k^* \times d \times 4$ bytes.
PQ codes: $N \times M$ bytes.
Vector IDs: $N \times 8$ bytes (int64).

Total:

M_{\text{IVF-PQ}} = N(M + 8) + (n_{\text{list}} + k^*) \cdot d \cdot 4

6.5 Worked Example: Full Pipeline

Parameters: $N = 10^6$ , $d = 768$ , $M = 96$ , $k^* = 256$ , $n_{\text{list}} = 1024$ , $n_{\text{probe}} = 16$ .

Memory:

Component	Size
PQ codes ( $N \times M$ )	$10^6 \times 96 = 96$ MB
Vector IDs ( $N \times 8$ )	8 MB
Centroids ( $1024 \times 768 \times 4$ )	3.0 MB
PQ codebooks ( $256 \times 768 \times 4$ )	0.75 MB
Total	~108 MB

Compared to brute-force storage of 2.86 GB, this represents a 26.5× compression.

Compute per query:

Stage	Operations
Coarse search	$1024 \times 768 = 786{,}432$
Table construction	$16 \times 256 \times 768 = 3{,}145{,}728$
ADC scan ( $15{,}632$ vectors $\times 96$ lookups)	$1{,}500{,}672$
Total	~5.4 × 10$^6$

Compared to brute-force ( $7.7 \times 10^8$ ), the speedup is approximately 143×, while the memory footprint has been reduced by a factor of 26.5.

7. Practical Tradeoffs

Recall vs latency tradeoff — Recall vs Latency Tradeoff (IVF-PQ)

Accuracy vs. Speed

Both HNSW and IVF-PQ trade exactness for speed, but they occupy different points in the design space:

HNSW achieves very high recall ( $>0.99$ ) with moderate compute overhead. Its graph structure makes it well-suited for latency-critical applications where memory is abundant.
IVF-PQ sacrifices some recall (typically 0.85–0.95 depending on parameters) in exchange for dramatic memory savings. It is the method of choice when the dataset does not fit in RAM at full precision.

Memory vs. Recall

The relationship between PQ parameters and recall is governed by the quantization distortion. The expected squared distortion per subspace is bounded by the $k$ -means quantization error:

\mathbb{E}\left[\|x^{(m)} - c^{(m)}_{q^{(m)}(x)}\|^2\right] \leq \sigma^2_{d/M}(k^*)

where $\sigma^2_{d/M}(k^*)$ is the optimal quantization error for a $(d/M)$ -dimensional distribution with $k^*$ centroids. Increasing $M$ (more subspaces, each of lower dimension) or increasing $k^*$ reduces distortion but increases code size. In practice, the subspace dimension $d/M$ between 4 and 16 offers the best balance.

Hybrid Approaches

Modern systems often combine both methods. A common architecture uses IVF-PQ as a first-pass retrieval stage, followed by re-ranking with exact distances on the original (or HNSW-indexed) vectors. This two-stage pattern extracts the memory efficiency of PQ and the precision of full-vector comparison:

IVF-PQ retrieves the top $k' \gg k$ candidates (e.g., $k' = 10k$ ).
Original vectors for these $k'$ candidates are fetched and re-ranked.
The final top $k$ are returned.

The re-ranking step costs $O(k' \cdot d)$ , which is negligible for moderate $k'$ .

Scaling Considerations

Metric	Brute Force	HNSW	IVF-PQ
Query time	$O(Nd)$	$O(\log N)$	$O(\sqrt{N} \cdot M)$
Memory	$4Nd$	$\approx 4Nd$	$N(M+8)$
Index build	$O(1)$	$O(N \log N)$	$O(NdI)$ *
Update support	Trivial	Incremental	Requires retraining

\* $I$ denotes $k$ -means iterations.

HNSW supports incremental insertion, making it suitable for streaming workloads. IVF-PQ requires training on a representative sample and is better suited for static or batch-updated corpora.

The index only ranks the vectors it receives. For RAG systems, document chunking decides what those vectors represent before HNSW or IVF-PQ ever runs. The follow-up article on RAG chunking strategy for vector search covers chunk size, overlap, semantic boundaries, parent-child retrieval, and the retrieval metrics that catch chunking regressions.

8. Conclusion

The transition from brute-force to approximate nearest neighbor search is not a single trick but a composition of orthogonal ideas. HNSW reduces the *number* of distance computations from $O(N)$ to $O(\log N)$ by organizing vectors in a multi-layer navigable graph. IVF reduces the search scope to a fraction of the dataset via Voronoi partitioning. Product Quantization reduces the *cost* of each distance computation and the *memory* per vector by compressing $d$ -dimensional vectors into $M$ -byte codes. IVF-PQ combines these last two into a system that can index a billion vectors in tens of gigabytes of RAM with query latencies in the low milliseconds.

Understanding the mathematical underpinnings — layer probability distributions, Voronoi cell geometry, subspace quantization distortion, and asymmetric distance computation — is essential for tuning these systems in production. The parameters $ef$ , $M$ , $n_{\text{list}}$ , $n_{\text{probe}}$ , and the PQ subspace configuration $(M, k^*)$ are not arbitrary knobs; they are direct consequences of the tradeoffs formalized above.

References

Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 42(4), 824–836.
Jégou, H., Douze, M., & Schmid, C. (2011). Product Quantization for Nearest Neighbor Search. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 33(1), 117–128.
Johnson, J., Douze, M., & Jégou, H. (2019). Billion-Scale Similarity Search with GPUs. *IEEE Transactions on Big Data*, 7(3), 535–547.
Babenko, A., & Lempitsky, V. (2014). The Inverted Multi-Index. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 37(6), 1247–1260.
Ge, T., He, K., Ke, Q., & Sun, J. (2014). Optimized Product Quantization. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 36(4), 744–755.
Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.-E., Lomeli, M., Hosseini, L., & Jégou, H. (2024). The Faiss Library. *arXiv preprint arXiv:2401.08281*.

Why Is Vector Search So Fast? HNSW and IVF-PQ Explained With the Math

Abstract

1. The Nearest Neighbor Problem

2. Brute Force Baseline

3. HNSW (Hierarchical Navigable Small World)

3.1 Navigable Small-World Graphs

3.2 Layer Assignment

3.3 Greedy Search

3.4 Complexity Analysis

3.5 Recall–Latency Tradeoff

3.6 Memory

4. IVF (Inverted File Index)

4.1 Voronoi Partitioning

4.2 Query Procedure

4.3 Recall as a Function of $n_{\text{probe}}/n_{\text{list}}$

4.4 Worked Example

5. Product Quantization

5.1 Subspace Decomposition

5.2 Codebook Learning

5.3 Memory Per Vector

5.4 Worked Example: Memory Savings

5.5 Asymmetric Distance Computation (ADC)

6. IVF-PQ Combined

6.1 Two-Stage Pipeline

6.2 Query Procedure

6.3 Compute Analysis

6.4 Memory Analysis

6.5 Worked Example: Full Pipeline

7. Practical Tradeoffs

Accuracy vs. Speed

Memory vs. Recall

Hybrid Approaches

Scaling Considerations

8. Conclusion

References

Related Articles

When RAG Finds the Right Document but Gives the Wrong Answer: Chunk Size, Overlap, and Retrieval Quality

Diagnosing Hallucinations with Attribution Traces and Retrieval Coverage Metrics

How AutoGen Designs Multi-Agent Research Systems: Agents, Tools, and Group Chat

Abstract

1. The Nearest Neighbor Problem

2. Brute Force Baseline

3. HNSW (Hierarchical Navigable Small World)

3.1 Navigable Small-World Graphs

3.2 Layer Assignment

3.3 Greedy Search

3.4 Complexity Analysis

3.5 Recall–Latency Tradeoff

3.6 Memory

4. IVF (Inverted File Index)

4.1 Voronoi Partitioning

4.2 Query Procedure

4.3 Recall as a Function of nprobe/nlistn_{\text{probe}}/n_{\text{list}}nprobe​/nlist​

4.4 Worked Example

5. Product Quantization

5.1 Subspace Decomposition

5.2 Codebook Learning

5.3 Memory Per Vector

5.4 Worked Example: Memory Savings

5.5 Asymmetric Distance Computation (ADC)

6. IVF-PQ Combined

6.1 Two-Stage Pipeline

6.2 Query Procedure

6.3 Compute Analysis

6.4 Memory Analysis

6.5 Worked Example: Full Pipeline

7. Practical Tradeoffs

Accuracy vs. Speed

Memory vs. Recall

Hybrid Approaches

Scaling Considerations

8. Conclusion

References

Related Articles

When RAG Finds the Right Document but Gives the Wrong Answer: Chunk Size, Overlap, and Retrieval Quality

Diagnosing Hallucinations with Attribution Traces and Retrieval Coverage Metrics

How AutoGen Designs Multi-Agent Research Systems: Agents, Tools, and Group Chat

4.3 Recall as a Function of $n_{\text{probe}}/n_{\text{list}}$