John Tan
John Tan

Reputation: 1385

Azure OpenAI citations with message correlation

I am trying out Azure OpenAI with my own data. The data is uploaded to Azure Blob Storage and indexed for use with Azure AI search

I do a call to the endpoint in the form of POST {endpoint}/openai/deployments/{deployment-id}/chat/completions?api-version={api-version}, as referenced here.

However, in the response I cannot figure out how the choices[0]['message']['context']['citations'] field correspond to the choices[0]['message']['content'].

For example, I can have a content as something like:

I have a pear [doc1][doc2]. I have an apple [doc1][doc3].

However, in my citations it looks like:

citations[0].filepath == 'file1.pdf'
citations[1].filepath == 'file2.pdf'
citations[2].filepath == 'file1.pdf'
citations[3].filepath == 'file3.pdf'
citations[4].filepath == 'file4.pdf'

In summary, my question is whether if there is some sort of mapping from doc as shown in the message to the citations.filepath.

Upvotes: 0

Views: 1125

Answers (1)

JayashankarGS
JayashankarGS

Reputation: 8040

Actually, it is not about the length of the citations; it is about how many times the file is referred.

If you observe clearly, you can see 'file1.pdf' is referred twice, so mappings will be based on the first appearance and reuse of docs like below:

  • doc1 -> citations[0] (file1.pdf).
  • doc2 -> citations[1] (file2.pdf).
  • Reuse of doc1 -> Refers back to the first document (citations[2], file1.pdf).
  • doc3 -> citations[3] (file3.pdf).

Use the code below to get mappings and use it in the content.

import re

def map_citations(content, citations):
    
    pattern = re.compile(r'\[doc(\d+)\]')
    segments = pattern.split(content)
    
    doc_numbers = []
    for segment in segments:
        if segment.isdigit():
            doc_numbers.append(int(segment))
    
    

    doc_to_file_map = {}
    for i, doc_num in enumerate(doc_numbers):
        doc_to_file_map[f'doc{doc_num}'] = citations[i]['filepath']

    print(doc_to_file_map)
    
    def replace_placeholder(match):
        doc_num = match.group(1)
        return f"[{doc_to_file_map[f'doc{doc_num}']}]"
    
    mapped_content = pattern.sub(replace_placeholder, content)
    
    return mapped_content

content = "I have a pear [doc1][doc2]. I have an apple [doc1][doc3]."
citations = [
    {'filepath': 'file1.pdf'},
    {'filepath': 'file2.pdf'},
    {'filepath': 'file1.pdf'},
    {'filepath': 'file3.pdf'},
    {'filepath': 'file4.pdf'}
]

mapped_content = map_citations(content, citations)
print(mapped_content)

Output:

enter image description here

Upvotes: 1

Related Questions