Reputation: 79
POS tagging for PROPN not working in an expected manner using the en_core_web_lg model.
POS tagging works more predictably using the _md model.
Given the (poorly-formed) sentence: "CK7, CK-20, GATA 3, PSA, are all negative."
When using the _lg model, "CK7" is tagged as a NOUN(NNS).
When using the _md model, "CK7" is tagged as a PROPN(NNP). This is correct.
When using the _lg model, and replacing "CK7" in the sentence for:
"CK1" tagged as PROPN
"CK2" tagged as PROPN
"CK3" ,"CK4" tagged as PROPN
"CK5" tagged as ADJ
"CK6" tagged as PROPN
"CK7" tagged as NOUN
"CK8" tagged as PROPN
"CK9" tagged as ADP
"CK22", "CK222", tagged as PROPN
When using the _md model, and replacing "CK7" as described above, all were tagged PROPN, as expected.
As most of the sentences I will be analyzing will be poorly formed, I thought that the _lg model's 'deeper' dependency parsing would serve better, only to find the above issues with POS tagging.
Please advise on:
Thank you very much.
Upvotes: 2
Views: 475
Reputation: 11494
So this is not a direct answer to your question, but if you are working with biomedical data it might make sense to try out this package: scispacy
It doesn't tag CK-7 as a proper noun, but it can handle lots of these kinds of terms as entities, see the various additional NER models that support different tagsets. It's still under development and you might still need to add special cases/exceptions for your data, but I think you will see better and more consistent results than with the standard spacy models.
Upvotes: 2