elbillaf
elbillaf

Reputation: 1984

Retrieving "source documents" on a RAG setup with langchain / llama

I have a set of a pdf documents (over 1000) which I've converted to text files. Let's call them "doc0001.txt", "doc0002.txt," etc. I've set up a RAG setup to query these documents.

Say doc0001.txt has references that list docA, docB, docC, etc.

I have code that queries against this text corpus like this:

prompt = "Tell me about artificial intelligence in medicine"
output = qa_llm({'query': prompt})

print (output["result"], '|'.join([i.page_content for i in output['source_documents']]))

It works! Kinda. But it doesn't give me what I want or expected. The answer that it gives, lists sources that are listed INSIDE of the documents "doc0001.txt" "doc0002.txt" etc.

That is, it lists docA, docB, etc.

That's useful, but in this case what I need to know is which of the source documents that I provided contain the information - not the references listed inside those documents. That is, the answer I want (in this case) is doc0357.txt, doc0784.txt, etc.

Is there a command to get THAT information?

Upvotes: 0

Views: 2400

Answers (1)

lif cc
lif cc

Reputation: 471

It is a little mess...

I can't understand what do you mean.

Maybe you need this?

loader = PyMuPDFLoader("./django-design-patterns-best-practices-2nd.pdf")

data = loader.load()

data[0]

#Document(page_content='', 


metadata={'source': './django-design-patterns-best-practices-2nd.pdf',

 'file_path': './django-design-patterns-best-practices-2nd.pdf', 'page': 0, 'total_pages': 274, 'format': 'PDF 1.4', 'title': 'Django Design Patterns and Best Practices', 'author': 'Arun Ravindran', 'subject': '', 'keywords': '', 'creator': 'Adobe Acrobat 9.5.5', 'producer': 'Acrobat Distiller 9.5.5 (Windows)', 'creationDate': "D:20180714073241-04'00'", 'modDate': "D:20180714073552-04'00'", 'trapped': ''})

output.metadata["source"] will contain the source file of the content.

Upvotes: 1

Related Questions