Reputation: 1532
I have this simple example of chunking in nltk.
My data:
data = 'The little yellow dog will then walk to the Starbucks, where he will introduce them to Michael.'
...pre-processing ...
data_tok = nltk.word_tokenize(data) #tokenisation
data_pos = nltk.pos_tag(data_tok) #POS tagging
CHUNKING:
cfg_1 = "CUSTOMCHUNK: {<VB><.*>*?<NNP>}" #should return `walk to the Starbucks`, etc.
chunker = nltk.RegexpParser(cfg_1)
data_chunked = chunker.parse(data_pos)
This returns (among other stuff): (CUSTOMCHUNK walk/VB to/TO the/DT Starbucks/NNP)
, so it did what I wanted it to do.
Now my question: I want to switch to spacy for my projects. How would I do this in spacy?
I come as far as to tag it (the coarser .pos
method will do for me):
from spacy.en import English
parser = English()
parsed_sent = parser(u'The little yellow dog will then walk to the Starbucks, where')
def print_coarse_pos(token):
print(token, token.pos_)
for sentence in parsed_sent.sents:
for token in sentence:
print_coarse_pos(token)
... which returns the tags and tokens
The DET
little ADJ
yellow ADJ
dog NOUN
will VERB
then ADV
walk VERB
...
How could I extract chunks with my own grammar?
Upvotes: 10
Views: 10191
Reputation: 1497
Copied verbatim from https://github.com/spacy-io/spaCy/issues/342
There's a few ways to go about this. The closest functionality to that RegexpParser
class is spaCy's Matcher
. But for syntactic chunking, I would typically use the dependency parse. For instance, for NPs chunking you have the doc.noun_chunks
iterator:
doc = nlp(text)
for np in doc.noun_chunks:
print(np.text)
The basic way that this works is something like this:
for token in doc:
if is_head_of_chunk(token)
chunk_start = token.left_edge.i
chunk_end = token.right_edge.i + 1
yield doc[chunk_start : chunk_end]
You can define the hypothetical is_head_of
function however you like. You can play around with the dependency parse visualizer to see the syntactic annotation scheme, and figure out what labels to use: http://spacy.io/demos/displacy
Upvotes: 4