POS tagging not consistent using Spacy en_core_web_lg model

Question

POS tagging for PROPN not working in an expected manner using the en_core_web_lg model.
POS tagging works more predictably using the _md model.

Given the (poorly-formed) sentence: "CK7, CK-20, GATA 3, PSA, are all negative."

When using the _lg model, "CK7" is tagged as a NOUN(NNS).

When using the _md model, "CK7" is tagged as a PROPN(NNP). This is correct.

When using the _lg model, and replacing "CK7" in the sentence for:

"CK1" tagged as PROPN
"CK2" tagged as PROPN
"CK3" ,"CK4" tagged as PROPN
"CK5" tagged as ADJ
"CK6" tagged as PROPN
"CK7" tagged as NOUN
"CK8" tagged as PROPN
"CK9" tagged as ADP
"CK22", "CK222", tagged as PROPN

When using the _md model, and replacing "CK7" as described above, all were tagged PROPN, as expected.

As most of the sentences I will be analyzing will be poorly formed, I thought that the _lg model's 'deeper' dependency parsing would serve better, only to find the above issues with POS tagging.

Please advise on:

How to deal with the counter-intuitive POS tagging when using the en_core_web_lg model?
Which model is best for dependency parsing poorly-formed sentences?

Thank you very much.

POS tagging not consistent using Spacy en_core_web_lg model

Answers (1)

Related Questions