user9845038
user9845038

Reputation: 41

How to use words directly in nltk grammar

I am trying to extract a company name form a given German address using python NLTK. This is the code I used,

import nltk

address="CompanyName GmbH * Keltenstr. 16 * 123456 Kippenheim * Deutschland"
tokens = nltk.word_tokenize(address)
textTokens = nltk.Text(tokens)
POStagList = nltk.pos_tag(textTokens)
print(POStagList)

grammar = """
        NP: 
            {<NN.?|JJ|FW>GmbH}"""


cp = nltk.RegexpParser(grammar)
result = cp.parse(POStagList)

for subtree in result.subtrees(filter=lambda t: t.label() == 'NP'):
   print("NP Subtree:", subtree)

I need the output: CompanyName GmbH

Sometimes instead of GmbH it may be corp or Inc. or llc , etc

How to solve this?

How to use string values & escape sequence characters directly inside grammar?

Upvotes: 1

Views: 245

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

Instead of mixing grammar with literal strings, you may use a work around using regex: tag the tokens with POS, and then only grab those tokens you need before known words (like GmbH).

The code will look like

import nltk
import re
address="CompanyName GmbH * Keltenstr. 16 * 123456 Kippenheim * Deutschland"
tokens = nltk.word_tokenize(address)
textTokens = nltk.Text(tokens)
POStagList = nltk.pos_tag(textTokens)
joined = ' '.join(["{}<{}>".format(word,tag) for word,tag in POStagList])
grammar = r'NN[^>]?|JJ|FW' # regex! 
print([re.sub("<(?:{})>".format(grammar), "", x.strip()) for x in re.findall(r'((?:\S+<(?:{0})> )+)(?:GmbH|Inc|corp|llc)<(?:{0})>'.format(grammar), joined)])

Output: ['CompanyName'].

Here, the grammar is specified using a regex like NN[^>]?|JJ|FW where [^>]? matches any char but > (just to make sure we do not match >, as . would do). After that, ((?:\S+<(?:NN[^>]?|JJ|FW)> )+)(?:GmbH|Inc|corp|llc)<(?:NN[^>]?|JJ|FW)> regex will find all the matches you need, but since they contain tags, they must be removed with a re.sub with a mere <(?:NN[^>]?|JJ|FW)> regex.

The main regex details:

  • ((?:\S+<(?:NN[^>]?|JJ|FW)> )+) - Group 1: one or more sequences of 1+ non-whitespace chars followed with <, then NN + any 1 or 0 chars other than >, or JJ or FW, and then > and then a space
  • (?:GmbH|Inc|corp|llc) - any of the alternatives: GmbH, Inc, corp or llc
  • <(?:NN[^>]?|JJ|FW)> - <, NN + any 1 or 0 chars other than >, or JJ or FW.

Upvotes: 1

Related Questions