Reputation: 20223
Please help me to understand the following Spacy example of non aligned entities:
"text" : "15) Abstract:“The contribution of surface charge was been quantitative determined” -> Correct the grammar."
"labels" : [[13, 82, "LOCATION"], [4, 12, "LOCATION"], [86, 105, "ACTION"]]}
To me it looks al good, the entities are well aligned. Any idea why I am getting the
[W030] Some entities could not be aligned in the text
If i add a space between the semi-colon and double quote after the abstract Abstract:“The
and change the entity numbering accordingly in order to have:
"text" : "15) Abstract: “The contribution of surface charge was been quantitative determined” -> Correct the grammar."
"labels" : [[14, 82, "LOCATION"], [4, 12, "LOCATION"], [87, 106, "ACTION"]]}
Then everything looks ok. I would like to understand why there is such difference.
EDIT:
Here is the code I am trying to use in order to get read of this issue, and it works with infixes.extend((":"))
, however, why it doesn't work with infixes.extend((":", "“", ",", '“', "/", ";", ".", '”'))
nlp = spacy.blank("en")
nlp.add_pipe("ner")
infixes = list(nlp.Defaults.infixes)
#infixes.extend((":", "“", ",", '“', "/", ";", ".", '”'))
infixes.extend((":"))
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer
Upvotes: 1
Views: 1255
Reputation: 15623
Basically your tokenization is weird. You can check tokenization like this:
print(nlp.tokenizer.explain(text))
In this case you'll see that Abstract:“The
is a single token. That's pretty weird and I'm not entirely sure why it's happening, but that is the source of your problem.
Upvotes: 2