Milos Cuculovic
Milos Cuculovic

Reputation: 20223

W030 Some entities could not be aligned in the text - but why?

Please help me to understand the following Spacy example of non aligned entities:

"text" : "15) Abstract:“The contribution of surface charge was been quantitative determined” -> Correct the grammar."

"labels" : [[13, 82, "LOCATION"], [4, 12, "LOCATION"], [86, 105, "ACTION"]]}

To me it looks al good, the entities are well aligned. Any idea why I am getting the

[W030] Some entities could not be aligned in the text

If i add a space between the semi-colon and double quote after the abstract Abstract:“Theand change the entity numbering accordingly in order to have:

"text" : "15) Abstract: “The contribution of surface charge was been quantitative determined” -> Correct the grammar."

"labels" : [[14, 82, "LOCATION"], [4, 12, "LOCATION"], [87, 106, "ACTION"]]}

Then everything looks ok. I would like to understand why there is such difference.

EDIT:

Here is the code I am trying to use in order to get read of this issue, and it works with infixes.extend((":")), however, why it doesn't work with infixes.extend((":", "“", ",", '“', "/", ";", ".", '”'))

nlp = spacy.blank("en")
nlp.add_pipe("ner")
infixes = list(nlp.Defaults.infixes)
#infixes.extend((":", "“", ",", '“', "/", ";", ".", '”'))
infixes.extend((":"))
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer

Upvotes: 1

Views: 1255

Answers (1)

polm23
polm23

Reputation: 15623

Basically your tokenization is weird. You can check tokenization like this:

print(nlp.tokenizer.explain(text))

In this case you'll see that Abstract:“The is a single token. That's pretty weird and I'm not entirely sure why it's happening, but that is the source of your problem.

Upvotes: 2

Related Questions