How to evaluate the performance of sentence embedding models against benchmark dataset

Question

I am relatively new to this field and would like guidance on how to effectively test an embedding model using a benchmark dataset. Specifically, I have acquired a few embedding models related to healthcare/medical topics from Hugging Face and wish to compare their performance.

Upon reviewing some literature, I noticed that authors evaluated their model on the PubMed benchmark dataset. Consequently, I decided to evaluate my models using this benchmark as well.

The authors provided details on the datasets used for evaluation, including PubMed QA Subset, PubMed Subset Split, and PubMed Summary Subset. Each dataset is structured differently, with varying pairs of data points and evaluation metrics such as the Pearson correlation coefficient, as they mentioned:

The following datasets were used to evaluate model performance.

PubMed QA Subset: pqa_labeled, Split: train, Pair: (question, long_answer)

PubMed Subset Split: test, Pair: (title, text)

PubMed Summary Subset: pubmed, Split: validation, Pair: (article, abstract)

Evaluation results are shown below. The Pearson correlation coefficient is used as the evaluation metric.

For instance, considering the PubMed QA Subset, specifically the pqa_labeled subset, it comprises four columns: question, context, long_answer, and final_decision (yes/no/maybe). To evaluate my model, I computed embeddings for both the questions and long answers and calculated their cosine similarity. I then mapped the final decisions to numeric values (0, 0.5, 1) and attempted to calculate the Pearson correlation between the cosine similarity and mapped final decisions. However, my calculated Pearson correlation coefficient ( -0.016513776110764652) did not match the reported value of 93.27, indicating an issue with my approach.

Similarly, when examining the PubMed Summary Subset, which contains columns such as article, abstract, and section_names, I encountered confusion regarding how to use this dataset to assess my model's performance. It's not immediately clear what constitutes the ground truth here, and consequently, I'm uncertain how to calculate the Pearson correlation coefficient in this context.

I would greatly appreciate any insights or guidance on how to correctly evaluate my embedding models using these benchmark datasets. Thank you.

How to evaluate the performance of sentence embedding models against benchmark dataset

Answers (0)

Related Questions