priyanka
priyanka

Reputation: 40

Q&A using Retrieval-Augmented Generation with Langchain

I have been doing a POC to implement RAG driven model for my AI/ML use case.
The use case is to "Find Similar and duplicate controls by comparing each ID with every other ID, Generate similarity scores and list the pairs which exceeds a threshold of 80-87 for similar controls and exceeding above 95 for duplicate controls"

The code snippet is :

loader = CSVLoader(file_path="control.csv")
data = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(data)
vectorstore = Chroma.from_documents(documents=chunks, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()
template = """You are an assistant for question-answering tasks.

Use the following pieces of retrieved context to answer the question.

If you don't know the answer, just say that you don't know.

Use three sentences maximum and keep the answer concise.

Question: {question}

Context: {context}

Answer:

"""

prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo",verbose=True)

rag_chain = ( {"context": retriever, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser() )

query = "FInd Similar controls by comparing each ID with every other ID in the document, combining their Name and Description. Calculate similarity scores between them and list all the pairs that is exceeding a threshold of 80-87for similar controls and above 95 for duplicate controls."

rag_chain.invoke(query)

The output i got was :
1. There are a total of 6 controls formed by comparing each ID with every other ID in the document. The similarity scores between them can be calculated and pairs exceeding a threshold of 80 can be listed in the output.
2. I don't Know

My expected outcome is to print the list of Similar and Duplicate pairs from the data , it has around 3500+ data.

But i dont find to see the expected output here ? Iam not sure where am wrong. Also would like to know if i have mentioned the right prompt for the scenario.

Also, I have tried the same prompt where i have not implemented RAG , but i could proper results , it just a connection made with Langchain and OpenAI for interaction.

I would like to know where am wrong and what needs to be corrected in order to get the right expected outcome.

Upvotes: 0

Views: 493

Answers (1)

Eric Vaillancourt
Eric Vaillancourt

Reputation: 81

When you say:

Blockquote My expected outcome is to print the list of Similar and Duplicate pairs from the data , it has around 3500+ data.

First, in your prompt you need to be explicit on how you want the output to be.

Like:

Blockquote Output the result in CSV format and only list Similar and Duplicate pairs.

Second, you could try to use a different output parse like Pydantic or the Structured Output parser

Be careful with the Pydantic parser because it’s sensitive to version changes with LangChain.

Third, you should implement a system prompt to give precise instructions to the LLM. You need to do this because in LangChain if you don’t supply a system prompt, LangChain will provide a basic one that could conflict with your question.

Hope this helps!

Upvotes: 0

Related Questions