Reputation: 433
I am running spacy on a paragraph of text and it's not extracting text in quote the same way for each, and I don't understand why that is
nlp = spacy.load("en_core_web_lg")
doc = nlp("""A seasoned TV exec, Greenblatt spent eight years as chairman of NBC Entertainment before WarnerMedia. He helped revive the broadcast network's primetime lineup with shows like "The Voice," "This Is Us," and "The Good Place," and pushed the channel to the top of the broadcast-rating ranks with 18-49-year-olds, Variety reported. He also drove Showtime's move into original programming, with series like "Dexter," "Weeds," and "Californication." And he was a key programming exec at Fox Broadcasting in the 1990s.""")
Here's the whole output:
A
seasoned
TV
exec
,
Greenblatt
spent
eight years
as
chairman
of
NBC Entertainment
before
WarnerMedia
.
He
helped
revive
the
broadcast
network
's
primetime
lineup
with
shows
like
"
The Voice
,
"
"
This
Is
Us
,
"
and
"The Good Place
,
"
and
pushed
the
channel
to
the
top
of
the
broadcast
-
rating
ranks
with
18-49-year-olds
,
Variety
reported
.
He
also
drove
Showtime
's
move
into
original
programming
,
with
series
like
"
Dexter
,
"
"
Weeds
,
"
and
"
Californication
.
"
And
he
was
a
key
programming
exec
at
Fox Broadcasting
in
the 1990s
.
The one that bothers me the most is The Good Place, which is extracted as "The Good Place
. Since the quotation is part of the token, I then can't extract text in quote with a Token Matcher later on… Any idea what's going on here?
Upvotes: 0
Views: 699
Reputation: 11474
The issue isn't the tokenization (which should always split "
off in this case), but the NER, which uses a statistical model and doesn't always make 100% perfect predictions.
I don't think you've shown all your code here, but from the output, I would assume you've merged entities by adding merge_entities
to the pipeline. These are the resulting tokens after entities are merged, and if an entity wasn't predicted correctly, you'll get slightly incorrect tokens.
I tried the most recent en_core_web_lg
and couldn't replicate these NER results, but the models for each version of spacy have slightly different results. If you haven't, try v2.2, which uses some data augmentation techniques to improve the handling of quotes.
Upvotes: 1