Bhavya
Bhavya

Reputation: 55

BM25 + PgVector Dense retriever doesn't give expected accuracy in hybrid searching

Iam a building a prototype for fetching the relevant documents for an input question (should search based on keywords and context). For this, I have the data frames of vector embeddings (all-mpnet-base-v2) of different documents which are stored in PGVector. Iam using an ensembled retriever with BM25 as a keyword based retriever and PGVector search query as the context based conten retriever. Here Iam attaching the code

def hybrid_search(question):
    embeddings = HuggingFaceEmbeddings(model_name="all-mpnet-base-v2")
    df_ = pd.read_csv("contents.csv", usecols=["enhancedContent"])
    loader_ = DataFrameLoader(df_, page_content_column='enhancedContent')
    docs = loader_.load()
    pages = loader_.load_and_split()

    bm25_retriever = BM25Retriever.from_documents(pages)
    bm25_retriever.k = 2  

    collection_name = "dummy_db" 
    CONNECTION_STRING = config.get("pg_vector_details", "CONNECTION_STRING") % 
     quote_plus(
        config.get("pg_vector_details", "password"))


    store1 = PGVector(
        collection_name=collection_name,
        connection_string=CONNECTION_STRING,
        embedding_function=embeddings, )

    retriever_pgvector = store1.as_retriever(
        search_kwargs={"k": 3}
    )

    ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, 
     retriever_pgvector], weights=[0.4, 0.6])

    context = ensemble_retriever.get_relevant_documents(question)
    print("Context from DB- Ensemble retriever: ", context)
    return context, ensemble_retriever

This is not giving the expected content for the question "What is meant by expect_column_values_to_be_between?" even if I have a matching document present in the DB. The content present in DB which is relevant to the question is given below

The content contains information about 'expect_column_values_to_be_between', which is a sub-title coming under a hierarchy of titles as ['Rule library']. The actual content starts from here: Description: Validates that entries in a specified column fall within a defined inclusive range, ensuring data adheres to expected bounds.
Dimension: Accuracy
Rule Level: Column
Mandatory Argument(s):
    1.Column name (Supported data types: Numeric )
    2.Enter lower bound (Supported data types: Numeric )
    3.Enter upper bound (Supported data types: Numeric )
Optional Argument(s):
    1.Value to be greater than lower bound ( When switched ON, rule succeeds only when value is strictly greater than specified lower bound value; when OFF, rule succeeds even when the value is greater than or equal to the specified lower bound value. By default, it is switched OFF )
    2.Value to be lesser than upper bound ( When switched ON, rule succeeds only when value is strictly less than specified upper bound value; when OFF, rule succeeds even when the value is lower than or equal to the specified upper bound value. By default, it is switched OFF )
    3.Tolerance level (%) ( Percentage of records that is expected to meet the required criteria, below which the rule fails. By default it is set to 100%, meaning all records are expected to meet specified criteria )

How should I modify the approach (other approaches are also welcome)? Thanks in advance.

Upvotes: 1

Views: 911

Answers (0)

Related Questions