Reputation: 1385
I am trying out Azure OpenAI with my own data. The data is uploaded to Azure Blob Storage and indexed for use with Azure AI search
I do a call to the endpoint in the form of POST {endpoint}/openai/deployments/{deployment-id}/chat/completions?api-version={api-version}
, as referenced here.
However, in the response I cannot figure out how the choices[0]['message']['context']['citations']
field correspond to the choices[0]['message']['content']
.
For example, I can have a content
as something like:
I have a pear [doc1][doc2]. I have an apple [doc1][doc3].
However, in my citations
it looks like:
citations[0].filepath == 'file1.pdf'
citations[1].filepath == 'file2.pdf'
citations[2].filepath == 'file1.pdf'
citations[3].filepath == 'file3.pdf'
citations[4].filepath == 'file4.pdf'
In summary, my question is whether if there is some sort of mapping from doc
as shown in the message to the citations.filepath
.
Upvotes: 0
Views: 1125
Reputation: 8040
Actually, it is not about the length of the citations
; it is about how many times the file is referred.
If you observe clearly, you can see 'file1.pdf'
is referred twice, so mappings will be based on the first appearance and reuse of docs like below:
doc1
-> citations[0]
(file1.pdf).doc2
-> citations[1]
(file2.pdf).doc1
-> Refers back to the first document (citations[2]
, file1.pdf).doc3
-> citations[3]
(file3.pdf).Use the code below to get mappings and use it in the content.
import re
def map_citations(content, citations):
pattern = re.compile(r'\[doc(\d+)\]')
segments = pattern.split(content)
doc_numbers = []
for segment in segments:
if segment.isdigit():
doc_numbers.append(int(segment))
doc_to_file_map = {}
for i, doc_num in enumerate(doc_numbers):
doc_to_file_map[f'doc{doc_num}'] = citations[i]['filepath']
print(doc_to_file_map)
def replace_placeholder(match):
doc_num = match.group(1)
return f"[{doc_to_file_map[f'doc{doc_num}']}]"
mapped_content = pattern.sub(replace_placeholder, content)
return mapped_content
content = "I have a pear [doc1][doc2]. I have an apple [doc1][doc3]."
citations = [
{'filepath': 'file1.pdf'},
{'filepath': 'file2.pdf'},
{'filepath': 'file1.pdf'},
{'filepath': 'file3.pdf'},
{'filepath': 'file4.pdf'}
]
mapped_content = map_citations(content, citations)
print(mapped_content)
Output:
Upvotes: 1