Zippy242
Zippy242

Reputation: 79

POS tagging not consistent using Spacy en_core_web_lg model

Given the (poorly-formed) sentence: "CK7, CK-20, GATA 3, PSA, are all negative."

When using the _lg model, "CK7" is tagged as a NOUN(NNS).

When using the _md model, "CK7" is tagged as a PROPN(NNP). This is correct.

When using the _lg model, and replacing "CK7" in the sentence for:

When using the _md model, and replacing "CK7" as described above, all were tagged PROPN, as expected.

As most of the sentences I will be analyzing will be poorly formed, I thought that the _lg model's 'deeper' dependency parsing would serve better, only to find the above issues with POS tagging.

Please advise on:

  1. How to deal with the counter-intuitive POS tagging when using the en_core_web_lg model?
  2. Which model is best for dependency parsing poorly-formed sentences?

Thank you very much.

Upvotes: 2

Views: 475

Answers (1)

aab
aab

Reputation: 11494

So this is not a direct answer to your question, but if you are working with biomedical data it might make sense to try out this package: scispacy

It doesn't tag CK-7 as a proper noun, but it can handle lots of these kinds of terms as entities, see the various additional NER models that support different tagsets. It's still under development and you might still need to add special cases/exceptions for your data, but I think you will see better and more consistent results than with the standard spacy models.

Upvotes: 2

Related Questions