Embedding Token limit overpass by chunking concatenation and dimensionality reduction

Question

If you want to generate embeddings for documents using Azure OpenAI with ada-002 model then you should sent maximum 8192 tokens to this API. If one document has more than 8K tokens then in order to process it we should follow specific steps as per my investigation.

Prepare document text, clean, normalize, remove-stop-words to be able to count tokens as Azure OpenAI ada-002 counts them.
Tokenize document text into words by splitting on space (" ")
If document's tokens are more than 8K then split it into more sub-documents with maximum 8K tokens
Pass these 8K sub-documents from the Azure OpenAI ada-002 endpoint and get embeddings for each sub-document.
Combine those float embeddings (by appending) into one single vector to represent the original document.
Then in order to be able to find similar documents based on question, question vector and document vectors should have same length, so we need obviously to reduce dimensionality on documents which were spitted and them re-embedded into single vector.

As an example, if a document (10K Tokens) is split into two sub documents (8K and 2K) each sub-document embedding will have 1536 dimensions and therefore the complete document will have 1536 x 2 = 3072. The question which is not exceed the 8K tokens will have 1536 and therefore cannot be compared with all documents.

So is there any way to reduce properly the dimensions of those documents of 3072 dims back to 1536 dims?

According to my research this can be done using PCA, i have found the following example in C#, but here the data are [][] instead of []:

double[][] data = new double[][]
{
// ... Your combined embedding vectors here
};

// Create a new Principal Component Analysis
var pca = new PrincipalComponentAnalysis()
{
Method = PrincipalComponentMethod.Center,
Whiten = false
};

// Learn the PCA model
pca.Learn(data);

// Transform the data into the reduced dimensionality space
double[][] reducedData = pca.Transform(data, 3); // Reducing to 3 dimensions

Any ideas?

Stavros Koureas · Accepted Answer

Found the answer, there are multiple approaches to approach this issue:

First, split the document in chunks (important notice is the way the document is spitted in chunks, we can split the document by sentences or per symbols or per fixed number of tokens), if we use specific number of tokens based on the model used, for example splitting the document into 256-Tokens, or 512-Tokens, or 1K-Tokens, is good for ADA-002 performance. Then embed each chunk using selected model, for example ADA-002, and then gather all embedded chunks of the document. In most of the cases token overlap in chunks can increase the quality of the solution.

Dimensionality reduction can be implemented with multiple ways:

One, good approach, is that after all chunks of the document are embedded we can take the average of each dimension beside the chunks. If a chunk of ADA-002 has 1532 dimensions, then we will have multiple chunks of 1532 dimensions. Taking the average of each dimension we will have again same dimension vector. This method is fast and easy to implement.

Second, approach, is that after all chunks of the document are embedded we can combine the document embeddings. If a chunk of ADA-002 has 1532 dimensions, then we will have 1532 x Number_of_Embeddings dimensions. Then we can use PCA to reduce dimensions back to original shape.

Just tested splitting document in fixed number of tokens set to 1K, with token overlap to 100 and taking the average as reduction method, seems to work pretty well.

Embedding Token limit overpass by chunking concatenation and dimensionality reduction

Answers (1)

Related Questions