Reputation: 432
I am working on linking short texts to entities in a biomedical knowledge graph (UMLS CUIs) using SciSpacy for a research project. The goal is to analyze the relationship between the linked entity and a separate predefined entity.
My challenge is managing multiple possible entities identified in the texts, which introduces noise into the results. Although I use heuristics such as regex, a manual stop list, and filtering by semantic categories (TUIs) to clean the data, the issue persists due to the text complexity. I typically select the top ~3 entities per text based on the NER score, with a relatively high threshold.
For instance, the text "Standard PRS for Alzheimer's" incorrectly links entities for "Standard" and "PRS," in addition to "Alzheimer's." Another example, "Other diseases of respiratory system, NEC," captures "respiratory" and "diseases" but misses "NEC" (Necrotizing enterocolitis), which should be prioritized.
I've tried filtering results by semantic similarity using a biomedical model, but this approach is still imprecise and heavily dependent on the number of results. The linker often seems to prioritize entities appearing earlier in the text. I also use an abbreviation expander to handle non-standard acronym forms.
I think a smarter linker (not supported by scispacy) might help, or better matching at the sentence/whole text level, but I don't know much about that. (I do some filtering of results using sentence transformers, but that's just cossine sim - I couldn't find a clear cutoff that generalized well).
I do not have the resources/time to learn to fine-tune a new linker model+data (this is just a sub-component in my overall phd).
I'm looking for advice on more effective strategies for entity linking at the sentence or whole-text level without the resources to fine-tune a new model. Compatability with SciSpacy is important, since linkage to the UMLS ontology (for the KG CUI entites) is a must.
Upvotes: 2
Views: 31
Reputation: 1200
here you will find some ideas :
Upvotes: 0