esteebie
esteebie

Reputation: 167

Azure search index with RAG vector search on own data. Chat completions doesn't return chunk_id as expected

When I index with a skillset that splits the text and then creates embeddings I am able to search and retrieve a chat completion along with relevant citations with a single API call to https://xxxxxxxxxxxxxx.openai.azure.com/openai/deployments/yyyyyyyyyyy/chat/completions?api-version=2024-03-01-preview.

The problem is, unlike when doing a keyword search on the original text documents, the chunk_id field that comes back is always 0, meaning I am unable to process the citations meaningfully for the user.

I think this is because the split skill results in a new record in the index for every chunk and the vector search returns the entire chunk rather than a portion of it.

There must be a way around this? If for example, you could specify the fields that were returned along with each citation, then it would be easy to derive a page number for each reference, since Azure automatically creates the document ID as parentDocumentId_pages01 etc.

Any help much appreciated and thanks for reading.

Case 1: Keyword search on non-chunked document in index:

"message": {
    "role": "assistant",
    "content": "zzzzz [doc2].",
    "end_turn": true,
    "context": {
        "citations": [
            {
                "content": "xxxxxxxxxx",
                "title": "Process Hierarchy - 22-11-2018 11-01.pdf",
                "url":"https://xxxxxxxxx.sharepoint.com/xxxxxxxxx/Process%20Hierarchy%20-%2022-11-2018%2011-01.pdf",
                "filepath": "/xxxxxxxxx/Process Hierarchy - 22-11-2018 11-01.pdf",
                "chunk_id": "31"}]}}

Case 2: vector search on index containing chunks of document split by skillset

{
    "id": "zzzzzzzzzzzzzz",
    "model": "gpt-35-turbo",
    "created": 1718180381,
    "object": "extensions.chat.completion",
    "choices": [
        {
            "index": 0,
            "finish_reason": "stop",
            "message": {
                "role": "assistant",
                "content": "yyyyyyyyyyyyyyyyyyy",
                "end_turn": true,
                "context": {
                    "citations": [
                        {
                            "content": "xxxxxxxxxxxxxxx",
                            "title": "doc1 Comms 2022.pdf",
                            "url": "https://xxxxxxxxxx.blob.core.windows.net/xxxx/doc1%20%20Comms%202022.pdf",
                            "filepath": null,
                            "chunk_id": "0"
                        },
                        {
                            "content": "yyyyyyyyyyyyyyyy",
                            "title": "doc1 Comms 2022.pdf",
                            "url": "https://xxxxxxxxxx.blob.core.windows.net/xxxx/doc1%20%20Comms%202022.pdf",
                            "filepath": null,
                            "chunk_id": "0"}]}}

You can see that in the second example the same file is returned twice since different chunks have been identified. The chunks are both labelled as '0' however.

The desired outcome would either be for chunk_id to be different for the two citations OR to somehow concatenate the array index from the result of the split skill with the document title so the titles become 'doc1 Comms 2022.pdf - pt 1' and 'doc1 Comms 2022.pdf - pt 2'

The only option seems to be using a custom web skill via an azure function, but that seems way too over-engineered for something so simple?

Here's the API call to https://xxxxxx/openai/deployments/yyyyy/chat/completions?api-version=2024-03-01-preview

A select statement on the index does not work as the completions api always sends back the same fields unless this can be configured somehow.

{
    "data_sources": [
        {
            "type": "azure_search",
            "parameters": {
                "endpoint": "https://xxxxx.search.windows.net",
                "authentication":{"type":"api_key","key": "yyyyyyy"},
                "index_name": "xxxxxx",
                "topNDocuments":5,
                "query_type": "vectorSimpleHybrid",
                "vectorFilterMode": "preFilter",
                "filter":"(search.ismatch('@{variables('region')}', 'region')) and (search.ismatch('@{variables('channel')}', 'channel'))",
                "embeddingEndpoint":"xxxxx/openai/deployments/yyyyyyy/embeddings?api-version=2024-02-15-preview",
                "embeddingKey":"xxxxxxx"
            }
        }
        ],
    "messages": [
        {
            "role": "system",
            "content": "You are..."
        },
        {
            "role": "user",
            "content": "@{triggerBody()?['text']}"
        }
    ],
    "temperature": 1.2,
    "top_p": 0.5,
    "frequency_penalty": 0,
    "presence_penalty": 0,
    "max_tokens": 2000,
    "stop": null
}

Upvotes: 0

Views: 672

Answers (1)

JayashankarGS
JayashankarGS

Reputation: 8040

When you doing query of type vector_simple_hybrid the parameter embedding_dependency is required check this documentation .

Either you pass DeploymentNameVectorizationSource or EndpointVectorizationSource.

Next, you were saying the chunk_id is not getting in the results, if you observe your file path as null.

To get proper results add fields_mapping in data source parameters.

"data_sources": [
        {
            "type": "azure_search",
            "parameters": {
                "endpoint": "https://xxxxx.search.windows.net",
                "authentication":{"type":"api_key","key": "yyyyyyy"},
                "index_name": "xxxxxx",
                "topNDocuments":5,
                "query_type": "vectorSimpleHybrid",
                "vectorFilterMode": "preFilter",
                "embedding_dependency":{
                        "endpoint":"https://{YOUR_RESOURCE_NAME}.openai.azure.com/openai/deployments/YOUR_DEPLOYMENT_NAME/embeddings",
                        "type":"endpoint",
                        "authentication":{
                            "type":"api_key",
                            "key":"your_key"
                        },
               "dimensions":"give_dimensions_like_1536_for_text-embedding-ada-002"
                },
                "filter":"(search.ismatch('@{variables('region')}', 'region')) and (search.ismatch('@{variables('channel')}', 'channel'))",
                "fields_mapping":{
                    "content_fields":"Your_content_field_in_index",
                    "filepath_field":"Your_filepath_or_chunkid_field_in_index",
                    "title_field":"Your_title_field_in_index"
                    
                }
            }
        }
        ]

This configuraton gives you the proper results.

Upvotes: 0

Related Questions