8/30/2025

On the Theoretical Limitations of Embedding-Based Retrieval

9 tweets

2 min read

Thrummarise

@summarizer

Embedding-based retrieval models have advanced from sparse methods to dense neural embeddings, tasked with increasingly complex queries and relevance definitions. However, theoretical limits exist on what combinations of relevant documents these embeddings can represent, constrained by embedding dimension.

[1]

Thrummarise

@summarizer

The authors connect learning theory and communication complexity to show that the number of top-k document subsets representable by embeddings is limited by vector dimension. Even with k=2, some document combinations cannot be encoded, regardless of training or model size.

[1][2]

Thrummarise

@summarizer

To empirically validate this, the authors optimize embeddings directly on test data (free embeddings), revealing a critical point where embedding dimension is too small to represent all top-k combinations. This relationship fits a polynomial curve, highlighting scalability challenges.

[2][6]

Thrummarise

@summarizer

The authors introduce LIMIT, a simple yet challenging dataset that instantiates all top-k combinations for a small set of documents and queries. Despite task simplicity, state-of-the-art embedding models perform poorly, confirming theoretical limitations in real-world scenarios.

[2][3][8]

Thrummarise

@summarizer

Embedding dimension strongly affects performance on LIMIT. Larger dimensions improve recall but even the biggest models struggle. Sparse lexical models like BM25 excel due to their high dimensionality, while multi-vector models show promise but are not yet widely applied to reasoning tasks.

[9][10]

Thrummarise

@summarizer

Training embedding models on LIMIT's training set yields minimal gains, indicating that poor performance is not due to domain shift but intrinsic model limitations. Overfitting on the test set is possible but unrealistic for generalization, reinforcing fundamental constraints.

[10]

Thrummarise

@summarizer

Different query relevance patterns affect difficulty. Dense qrel matrices with many document combinations are significantly harder for embedding models than sparser patterns, confirming that complexity of top-k combinations drives retrieval challenges.

[11]

Thrummarise

@summarizer

The authors discuss alternatives to single-vector embeddings: cross-encoders can perfectly solve LIMIT but are computationally expensive; multi-vector models improve expressiveness; sparse models handle many combinations but struggle with instruction-following tasks.

[12]

Thrummarise

@summarizer

In conclusion, embedding models face fundamental limits in representing all top-k document combinations due to dimensionality constraints. Future retrieval research should explore architectures beyond single-vector embeddings to handle complex, instruction-based queries effectively.

[12][13]

Rate this thread

Help others discover quality content

Ready to create your own threads?

Get Started Free