Reputation: 41
I am new to NLTK, I am trying to get Company Names from a String. This is the code i wrote. But its not giving the output.Is it possible to give String value directly in patterns? Can anyone please help me. Thanks in advance
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag,RegexpParser
text="CompanyName1 GmbH is from Germany. CompanyName2 Inc is from America. ComapnyName3 corp is from India."
pattern = r"""
P: {<NNP>+<GmbH|Inc|corp>}
"""
for sent in sent_tokenize(text):
sentence = sent.split()
print("Parts of speech :",pos_tag(sentence))
PChunker = RegexpParser(pattern)
output= PChunker.parse(pos_tag(sentence))
for subtree in output.subtrees(filter=lambda t: t.label() == 'P'):
# print(subtree)
print(' '.join([x[0] for x in subtree]))
Upvotes: 1
Views: 57
Reputation: 627082
You may combine the regex and NLTK features here.
import re
...
text="CompanyName1 GmbH is from Germany. CompanyName2 Inc is from America. Comapny Name3 corp is from India."
for sent in sent_tokenize(text):
tagged = pos_tag(word_tokenize(sent))
joined = ' '.join(["{}<{}>".format(word,tag) for word,tag in tagged])
print([x.strip().replace("<NNP>", "") for x in re.findall(r'((?:\S+<NNP> )+)(?:GmbH|Inc|corp)<NN[^>]*>', joined)])
print('-------- NEXT SENTENCE ----------')
This outputs:
['CompanyName1']
-------- NEXT SENTENCE ----------
['CompanyName2']
-------- NEXT SENTENCE ----------
['Comapny Name3']
-------- NEXT SENTENCE ----------
The joined = ' '.join(["{}<{}>".format(word,tag) for word,tag in tagged])
part creates a temporary sentence with tags appended to the words. The regex is ((?:\S+<NNP> )+)(?:GmbH|Inc|corp)<NN[^>]*>
, it matches
((?:\S+<NNP> )+)
- Capturing group 1 (it will be the output of re.findall
): 1 or more non-whitespace chars followed with <NNP>
and a space, all repeated 1 or more times (due to +
)(?:GmbH|Inc|corp)
- a non-capturing group that matches any of the the 3 alternatives (|
is an alternative operator)<NN[^>]*>
- a <NN
+ any 0 or more chars other than >
and then a >
.To get the final result, the tags should be removed from the company names, so you may just use x.strip().replace("<NNP>", "")
- strip the whitespace from start/end of the found match and remove the <NNP>
tag using a mere str.replace
method.
Upvotes: 1