Googlebot
Googlebot

Reputation: 15683

How to get tokens for noune phrases in spaCy?

I get the tokens and noun phrases with

text = ("This is commonly referred to as global warming or climate change.")
doc = nlp(text)

for token in doc:
    print(token.i, token.text)

print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])

and the result is

0 This
1 is
2 commonly
3 referred
4 to
5 as
6 global
7 warming
8 or
9 climate
10 change
11 .
Noun phrases: ['global warming', 'climate change']

is it possible to get the index of tokens for noun phrases instead of the words? For example

Noun phrases: ['6,7', '9,10']

Upvotes: 2

Views: 147

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627545

You may use the Span's start and end properties:

start   int     The index of the first token of the span.
end     int     The index of the first token after the span.

So, use

print("Noun phrases:", [(chunk.start,chunk.end-1) for chunk in doc.noun_chunks])
# => Noun phrases: [(6, 7), (9, 10)]

Or, if you need comma-separated string items,

 ["{},{}".format(chunk.start,chunk.end-1) for chunk in doc.noun_chunks]
 ## => Noun phrases: ['6,7', '9,10']

Upvotes: 2

Related Questions