Reputation: 1472
If you want to generate embeddings for documents using Azure OpenAI with ada-002 model then you should sent maximum 8192 tokens to this API. If one document has more than 8K tokens then in order to process it we should follow specific steps as per my investigation.
As an example, if a document (10K Tokens) is split into two sub documents (8K and 2K) each sub-document embedding will have 1536 dimensions and therefore the complete document will have 1536 x 2 = 3072. The question which is not exceed the 8K tokens will have 1536 and therefore cannot be compared with all documents.
So is there any way to reduce properly the dimensions of those documents of 3072 dims back to 1536 dims?
According to my research this can be done using PCA, i have found the following example in C#, but here the data are [][] instead of []:
double[][] data = new double[][]
{
// ... Your combined embedding vectors here
};
// Create a new Principal Component Analysis
var pca = new PrincipalComponentAnalysis()
{
Method = PrincipalComponentMethod.Center,
Whiten = false
};
// Learn the PCA model
pca.Learn(data);
// Transform the data into the reduced dimensionality space
double[][] reducedData = pca.Transform(data, 3); // Reducing to 3 dimensions
Any ideas?
Upvotes: 0
Views: 847
Reputation: 1472
Found the answer, there are multiple approaches to approach this issue:
First, split the document in chunks (important notice is the way the document is spitted in chunks, we can split the document by sentences or per symbols or per fixed number of tokens), if we use specific number of tokens based on the model used, for example splitting the document into 256-Tokens, or 512-Tokens, or 1K-Tokens, is good for ADA-002 performance. Then embed each chunk using selected model, for example ADA-002, and then gather all embedded chunks of the document. In most of the cases token overlap in chunks can increase the quality of the solution.
Dimensionality reduction can be implemented with multiple ways:
One, good approach, is that after all chunks of the document are embedded we can take the average of each dimension beside the chunks. If a chunk of ADA-002 has 1532 dimensions, then we will have multiple chunks of 1532 dimensions. Taking the average of each dimension we will have again same dimension vector. This method is fast and easy to implement.
Second, approach, is that after all chunks of the document are embedded we can combine the document embeddings. If a chunk of ADA-002 has 1532 dimensions, then we will have 1532 x Number_of_Embeddings dimensions. Then we can use PCA to reduce dimensions back to original shape.
Just tested splitting document in fixed number of tokens set to 1K, with token overlap to 100 and taking the average as reduction method, seems to work pretty well.
Upvotes: 1