Reputation: 55
Iam a building a prototype for fetching the relevant documents for an input question (should search based on keywords and context). For this, I have the data frames of vector embeddings (all-mpnet-base-v2) of different documents which are stored in PGVector. Iam using an ensembled retriever with BM25 as a keyword based retriever and PGVector search query as the context based conten retriever. Here Iam attaching the code
def hybrid_search(question):
embeddings = HuggingFaceEmbeddings(model_name="all-mpnet-base-v2")
df_ = pd.read_csv("contents.csv", usecols=["enhancedContent"])
loader_ = DataFrameLoader(df_, page_content_column='enhancedContent')
docs = loader_.load()
pages = loader_.load_and_split()
bm25_retriever = BM25Retriever.from_documents(pages)
bm25_retriever.k = 2
collection_name = "dummy_db"
CONNECTION_STRING = config.get("pg_vector_details", "CONNECTION_STRING") %
quote_plus(
config.get("pg_vector_details", "password"))
store1 = PGVector(
collection_name=collection_name,
connection_string=CONNECTION_STRING,
embedding_function=embeddings, )
retriever_pgvector = store1.as_retriever(
search_kwargs={"k": 3}
)
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever,
retriever_pgvector], weights=[0.4, 0.6])
context = ensemble_retriever.get_relevant_documents(question)
print("Context from DB- Ensemble retriever: ", context)
return context, ensemble_retriever
This is not giving the expected content for the question "What is meant by expect_column_values_to_be_between?" even if I have a matching document present in the DB. The content present in DB which is relevant to the question is given below
The content contains information about 'expect_column_values_to_be_between', which is a sub-title coming under a hierarchy of titles as ['Rule library']. The actual content starts from here: Description: Validates that entries in a specified column fall within a defined inclusive range, ensuring data adheres to expected bounds.
Dimension: Accuracy
Rule Level: Column
Mandatory Argument(s):
1.Column name (Supported data types: Numeric )
2.Enter lower bound (Supported data types: Numeric )
3.Enter upper bound (Supported data types: Numeric )
Optional Argument(s):
1.Value to be greater than lower bound ( When switched ON, rule succeeds only when value is strictly greater than specified lower bound value; when OFF, rule succeeds even when the value is greater than or equal to the specified lower bound value. By default, it is switched OFF )
2.Value to be lesser than upper bound ( When switched ON, rule succeeds only when value is strictly less than specified upper bound value; when OFF, rule succeeds even when the value is lower than or equal to the specified upper bound value. By default, it is switched OFF )
3.Tolerance level (%) ( Percentage of records that is expected to meet the required criteria, below which the rule fails. By default it is set to 100%, meaning all records are expected to meet specified criteria )
How should I modify the approach (other approaches are also welcome)? Thanks in advance.
Upvotes: 1
Views: 911