ben_aaron
ben_aaron

Reputation: 1532

Chunking with rule-based grammar in spacy

I have this simple example of chunking in nltk.

My data:

data = 'The little yellow dog will then walk to the Starbucks, where he will introduce them to Michael.'

...pre-processing ...

data_tok = nltk.word_tokenize(data) #tokenisation
data_pos = nltk.pos_tag(data_tok) #POS tagging

CHUNKING:

cfg_1 = "CUSTOMCHUNK: {<VB><.*>*?<NNP>}" #should return `walk to the Starbucks`, etc.
chunker = nltk.RegexpParser(cfg_1)
data_chunked = chunker.parse(data_pos)

This returns (among other stuff): (CUSTOMCHUNK walk/VB to/TO the/DT Starbucks/NNP), so it did what I wanted it to do.

Now my question: I want to switch to spacy for my projects. How would I do this in spacy?

I come as far as to tag it (the coarser .pos method will do for me):

from spacy.en import English    
parser = English()
parsed_sent = parser(u'The little yellow dog will then walk to the Starbucks, where')

def print_coarse_pos(token):
  print(token, token.pos_)

for sentence in parsed_sent.sents:
  for token in sentence:
    print_coarse_pos(token)

... which returns the tags and tokens The DET little ADJ yellow ADJ dog NOUN will VERB then ADV walk VERB ...

How could I extract chunks with my own grammar?

Upvotes: 10

Views: 10191

Answers (1)

fenceop
fenceop

Reputation: 1497

Copied verbatim from https://github.com/spacy-io/spaCy/issues/342

There's a few ways to go about this. The closest functionality to that RegexpParser class is spaCy's Matcher. But for syntactic chunking, I would typically use the dependency parse. For instance, for NPs chunking you have the doc.noun_chunks iterator:

doc = nlp(text)
for np in doc.noun_chunks:
    print(np.text)

The basic way that this works is something like this:

for token in doc:
    if is_head_of_chunk(token)
        chunk_start = token.left_edge.i
        chunk_end = token.right_edge.i + 1
        yield doc[chunk_start : chunk_end]

You can define the hypothetical is_head_of function however you like. You can play around with the dependency parse visualizer to see the syntactic annotation scheme, and figure out what labels to use: http://spacy.io/demos/displacy

Upvotes: 4

Related Questions