> Lots of startups are launching new “vector databases”—which are effectively databases that are custom built to answer nearest-neighbour queries against vectors as quickly as possible.
> I’m not convinced you need an entirely new database for this: I’m more excited about adding custom indexes to existing databases. For example, SQLite has sqlite-vss and PostgreSQL has pgvector.
Do we still feel specialized vector databases are an overkill?
We have AWS promoting amazon OpenSearch as the default vector database for a RAG knowledge base and that service is not cheap.
Also I would like to understand a bit more about how to pre-process and chunk the data properly in a way that optimizes the vector embeddings, storage and retrieval ... any good guides on the same i cna refer to? Thanks!
What’s the benefit of generating embeddings for such large chunks? Do people use these large contexts to include lots of document specific headers/footers or are they actually generating embeddings of single large documents?
I don’t understand how the math works out on those vectors
You don’t have to reduce a long context to a single embedding vector. Instead, you can compute the token embeddings of a long context and then pool those into say sentence embeddings.
The benefit is that each sentence’s embedding is informed by all of the other sentences in the context. So when a sentence refers to “The company” for example, the sentence embedding will have captured which company that is based on the other sentences in the context.
This technique is called ‘late chunking’ [1], and is based on another technique called ‘late interaction’ [2].
And you can combine late chunking (to pool token embeddings) with semantic chunking (to partition the document) for even better retrieval results. For an example implementation that applies both techniques, check out RAGLite [3].
I read both those articles, but I still don't get how to do it. It seems the idea is that more of the embedding is informed by context, but do I _do_ late chunking?
My best guess so far is that somehow I embed a long text and then I break up the returned embedding into multiple parts and search each separately? But that doesn't sound right.
The name ‘late chunking’ is indeed somewhat of a misnomer in the sense that the technique does not partition documents into document chunks. What it actually does is to pool token embeddings (of a large context) into say sentence embeddings. The result is that your document is now represented as a sequence of sentence embeddings, each of which is informed by the other sentences in the document.
Then, you want to parition the document into chunks. Late chunking pairs really well with semantic chunking because it can use late chunking's improved sentence embeddings to find semantically more cohesive chunks. In fact, you can cast this as a binary integer programming problem and find the ‘best’ chunks this way. See RAGLite [1] for an implementation of both techniques including the formulation of semantic chunking as an optimization problem.
Finally, you have a sequence of document chunks, each represented as a multi-vector sequence of sentence embeddings. You could choose to pool these sentence embeddings into a single embedding vector per chunk. Or, you could leave the multi-vector chunk embeddings as-is and apply a more advanced querying technique like ColBERT's MaxSim [2].
You’d need to go a level below the API that most embedding services expose.
A transformer-based embedding model doesn’t just give you a vector for the entire input string, it gives you vectors for each token. These are then “pooled” together (eg averaged, or max-pooled, or other strategies) to reduce these many vectors down into a single vector.
Late chunking means changing this reduction to yield many vectors instead of just one.
You can achieve the same effect by using LLM to do question answering prior to embedding, it's much more flexible but slower, you can use CoT, or even graph rag. Late chunking is a faster implicit alternative.
> What’s the benefit of generating embeddings for such large chunks?
Not an expert, but I believe now that we can fit more tokens into an LLM's context window, we can avoid a number of problems by providing additional context around any chunk of text that might be useful to the LLM. Solves the problem of misinterpretation of the important bit by the LLM.
I've found good results from summarizing my documents using a large context model then embedding those summaries using a standard embedding model (e.g. e5)
This way I can tune what aspects of the doc I want to focus retrieval on, it's easier to determine when there are any data quality issues that need to be fixed, and the summaries have turned out to be useful for other use cases in the company.
Agreed. Esp if you gonna call an API, you can call something cheaper than this embeddings model, like 4o-mini, summarize, then use a small embeddings model fine-tuned for your needs locally.
I was critical about these guys before (not about their quality of work but rather about building a business around embeddings). This work though seems interesting and I might even give it a try, esp if they provide a fine-tuning API (is that on the roadmap?)
Not related, but why they don’t have a pricing page? Last time I checked voyageai I had to google their pricing to find the page as it’s not in the nav menu.
This looks quite serious (which would be unsurprising given that Fei-Fei Li and Christopher Ré are involved).
I'm also quite interested in the nuts and bolts: does anyone know what the current accepted leaderboard on this is? I was screwing around with GritLM [1] a few months back and I seem to remember the MTEB [2] was kind of the headline thing at that time, but I might be out of date.
Related question:
One year ago simonw said this in a post about embeddings:
[https://news.ycombinator.com/item?id=37985489]
> Lots of startups are launching new “vector databases”—which are effectively databases that are custom built to answer nearest-neighbour queries against vectors as quickly as possible.
> I’m not convinced you need an entirely new database for this: I’m more excited about adding custom indexes to existing databases. For example, SQLite has sqlite-vss and PostgreSQL has pgvector.
Do we still feel specialized vector databases are an overkill?
We have AWS promoting amazon OpenSearch as the default vector database for a RAG knowledge base and that service is not cheap.
Also I would like to understand a bit more about how to pre-process and chunk the data properly in a way that optimizes the vector embeddings, storage and retrieval ... any good guides on the same i cna refer to? Thanks!
What’s the benefit of generating embeddings for such large chunks? Do people use these large contexts to include lots of document specific headers/footers or are they actually generating embeddings of single large documents?
I don’t understand how the math works out on those vectors
You don’t have to reduce a long context to a single embedding vector. Instead, you can compute the token embeddings of a long context and then pool those into say sentence embeddings.
The benefit is that each sentence’s embedding is informed by all of the other sentences in the context. So when a sentence refers to “The company” for example, the sentence embedding will have captured which company that is based on the other sentences in the context.
This technique is called ‘late chunking’ [1], and is based on another technique called ‘late interaction’ [2].
And you can combine late chunking (to pool token embeddings) with semantic chunking (to partition the document) for even better retrieval results. For an example implementation that applies both techniques, check out RAGLite [3].
[1] https://weaviate.io/blog/late-chunking
[2] https://jina.ai/news/what-is-colbert-and-late-interaction-an...
[3] https://github.com/superlinear-ai/raglite
I read both those articles, but I still don't get how to do it. It seems the idea is that more of the embedding is informed by context, but do I _do_ late chunking?
My best guess so far is that somehow I embed a long text and then I break up the returned embedding into multiple parts and search each separately? But that doesn't sound right.
The name ‘late chunking’ is indeed somewhat of a misnomer in the sense that the technique does not partition documents into document chunks. What it actually does is to pool token embeddings (of a large context) into say sentence embeddings. The result is that your document is now represented as a sequence of sentence embeddings, each of which is informed by the other sentences in the document.
Then, you want to parition the document into chunks. Late chunking pairs really well with semantic chunking because it can use late chunking's improved sentence embeddings to find semantically more cohesive chunks. In fact, you can cast this as a binary integer programming problem and find the ‘best’ chunks this way. See RAGLite [1] for an implementation of both techniques including the formulation of semantic chunking as an optimization problem.
Finally, you have a sequence of document chunks, each represented as a multi-vector sequence of sentence embeddings. You could choose to pool these sentence embeddings into a single embedding vector per chunk. Or, you could leave the multi-vector chunk embeddings as-is and apply a more advanced querying technique like ColBERT's MaxSim [2].
[1] https://github.com/superlinear-ai/raglite
[2] https://huggingface.co/blog/fsommers/document-similarity-col...
You’d need to go a level below the API that most embedding services expose.
A transformer-based embedding model doesn’t just give you a vector for the entire input string, it gives you vectors for each token. These are then “pooled” together (eg averaged, or max-pooled, or other strategies) to reduce these many vectors down into a single vector.
Late chunking means changing this reduction to yield many vectors instead of just one.
You can achieve the same effect by using LLM to do question answering prior to embedding, it's much more flexible but slower, you can use CoT, or even graph rag. Late chunking is a faster implicit alternative.
> What’s the benefit of generating embeddings for such large chunks?
Not an expert, but I believe now that we can fit more tokens into an LLM's context window, we can avoid a number of problems by providing additional context around any chunk of text that might be useful to the LLM. Solves the problem of misinterpretation of the important bit by the LLM.
I thought embedding large chunks would "dilute" the ideas, since large chunks tend to have multiple disparate ideas?
Does it somehow capture _all_ of the ideas, and querying for a single one would somehow match?
Isn't that the point of breaking down into sentences?
Someone mentioned adding context -- but doesn't it calculate embedding on the whole thing? The API Docs list `input` but no separate `context`. https://docs.voyageai.com/reference/embeddings-api
I've found good results from summarizing my documents using a large context model then embedding those summaries using a standard embedding model (e.g. e5)
This way I can tune what aspects of the doc I want to focus retrieval on, it's easier to determine when there are any data quality issues that need to be fixed, and the summaries have turned out to be useful for other use cases in the company.
Agreed. Esp if you gonna call an API, you can call something cheaper than this embeddings model, like 4o-mini, summarize, then use a small embeddings model fine-tuned for your needs locally.
I was critical about these guys before (not about their quality of work but rather about building a business around embeddings). This work though seems interesting and I might even give it a try, esp if they provide a fine-tuning API (is that on the roadmap?)
Not related, but why they don’t have a pricing page? Last time I checked voyageai I had to google their pricing to find the page as it’s not in the nav menu.
What on earth is "OpenAI V3"? Just to be sure I wasn't being obtuse, I Googled it, only to get a bunch of articles pointing back at this post.
https://openai.com/index/new-embedding-models-and-api-update...
API constant is text-embedding-3
You missed the “large” which adds context: “OpenAI V3 large” which is their sota large embedding model.
It is OpenAI's vector embedding model
https://hn.algolia.com/?q=https%3A%2F%2Fblog.voyageai.com%2F...
I would like to see an independent benchmark.
This looks quite serious (which would be unsurprising given that Fei-Fei Li and Christopher Ré are involved).
I'm also quite interested in the nuts and bolts: does anyone know what the current accepted leaderboard on this is? I was screwing around with GritLM [1] a few months back and I seem to remember the MTEB [2] was kind of the headline thing at that time, but I might be out of date.
[1] https://arxiv.org/pdf/2402.09906 [2] https://huggingface.co/blog/mteb
I build a RAG system with voyage and it crushed openai embeddings, the difference in retrieval quality was noticeable
What evaluation metrics did you use?