user9845038
user9845038

Reputation: 41

Apply String directly in NLTK pattern

I am new to NLTK, I am trying to get Company Names from a String. This is the code i wrote. But its not giving the output.Is it possible to give String value directly in patterns? Can anyone please help me. Thanks in advance

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag,RegexpParser
text="CompanyName1 GmbH is from Germany. CompanyName2 Inc is from America. ComapnyName3 corp is from India."
pattern = r"""
P: {<NNP>+<GmbH|Inc|corp>}
"""
for sent in sent_tokenize(text):
   sentence = sent.split()
   print("Parts of speech :",pos_tag(sentence))
   PChunker = RegexpParser(pattern)
   output= PChunker.parse(pos_tag(sentence))
   for subtree in output.subtrees(filter=lambda t: t.label() == 'P'):
     # print(subtree)
     print(' '.join([x[0] for x in subtree]))

Upvotes: 1

Views: 57

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

You may combine the regex and NLTK features here.

import re
...
text="CompanyName1 GmbH is from Germany. CompanyName2 Inc is from America. Comapny Name3 corp is from India."
for sent in  sent_tokenize(text):
    tagged = pos_tag(word_tokenize(sent))
    joined = ' '.join(["{}<{}>".format(word,tag) for word,tag in tagged])
    print([x.strip().replace("<NNP>", "") for x in re.findall(r'((?:\S+<NNP> )+)(?:GmbH|Inc|corp)<NN[^>]*>', joined)])
    print('-------- NEXT SENTENCE ----------')

This outputs:

['CompanyName1']
-------- NEXT SENTENCE ----------
['CompanyName2']
-------- NEXT SENTENCE ----------
['Comapny Name3']
-------- NEXT SENTENCE ----------

The joined = ' '.join(["{}<{}>".format(word,tag) for word,tag in tagged]) part creates a temporary sentence with tags appended to the words. The regex is ((?:\S+<NNP> )+)(?:GmbH|Inc|corp)<NN[^>]*>, it matches

  • ((?:\S+<NNP> )+) - Capturing group 1 (it will be the output of re.findall): 1 or more non-whitespace chars followed with <NNP> and a space, all repeated 1 or more times (due to +)
  • (?:GmbH|Inc|corp) - a non-capturing group that matches any of the the 3 alternatives (| is an alternative operator)
  • <NN[^>]*> - a <NN + any 0 or more chars other than > and then a >.

To get the final result, the tags should be removed from the company names, so you may just use x.strip().replace("<NNP>", "") - strip the whitespace from start/end of the found match and remove the <NNP> tag using a mere str.replace method.

Upvotes: 1

Related Questions