Reputation: 41
I am trying to extract a company name form a given German address using python NLTK. This is the code I used,
import nltk
address="CompanyName GmbH * Keltenstr. 16 * 123456 Kippenheim * Deutschland"
tokens = nltk.word_tokenize(address)
textTokens = nltk.Text(tokens)
POStagList = nltk.pos_tag(textTokens)
print(POStagList)
grammar = """
NP:
{<NN.?|JJ|FW>GmbH}"""
cp = nltk.RegexpParser(grammar)
result = cp.parse(POStagList)
for subtree in result.subtrees(filter=lambda t: t.label() == 'NP'):
print("NP Subtree:", subtree)
I need the output: CompanyName GmbH
Sometimes instead of GmbH it may be corp or Inc. or llc , etc
How to solve this?
How to use string values & escape sequence characters directly inside grammar?
Upvotes: 1
Views: 245
Reputation: 627082
Instead of mixing grammar with literal strings, you may use a work around using regex: tag the tokens with POS, and then only grab those tokens you need before known words (like GmbH
).
The code will look like
import nltk
import re
address="CompanyName GmbH * Keltenstr. 16 * 123456 Kippenheim * Deutschland"
tokens = nltk.word_tokenize(address)
textTokens = nltk.Text(tokens)
POStagList = nltk.pos_tag(textTokens)
joined = ' '.join(["{}<{}>".format(word,tag) for word,tag in POStagList])
grammar = r'NN[^>]?|JJ|FW' # regex!
print([re.sub("<(?:{})>".format(grammar), "", x.strip()) for x in re.findall(r'((?:\S+<(?:{0})> )+)(?:GmbH|Inc|corp|llc)<(?:{0})>'.format(grammar), joined)])
Output: ['CompanyName']
.
Here, the grammar is specified using a regex like NN[^>]?|JJ|FW
where [^>]?
matches any char but >
(just to make sure we do not match >
, as .
would do). After that, ((?:\S+<(?:NN[^>]?|JJ|FW)> )+)(?:GmbH|Inc|corp|llc)<(?:NN[^>]?|JJ|FW)>
regex will find all the matches you need, but since they contain tags, they must be removed with a re.sub
with a mere <(?:NN[^>]?|JJ|FW)>
regex.
The main regex details:
((?:\S+<(?:NN[^>]?|JJ|FW)> )+)
- Group 1: one or more sequences of 1+ non-whitespace chars followed with <
, then NN
+ any 1 or 0 chars other than >
, or JJ
or FW
, and then >
and then a space(?:GmbH|Inc|corp|llc)
- any of the alternatives: GmbH
, Inc
, corp
or llc
<(?:NN[^>]?|JJ|FW)>
- <
, NN
+ any 1 or 0 chars other than >
, or JJ
or FW
.Upvotes: 1