Apply String directly in NLTK pattern

Question

I am new to NLTK, I am trying to get Company Names from a String. This is the code i wrote. But its not giving the output.Is it possible to give String value directly in patterns? Can anyone please help me. Thanks in advance

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag,RegexpParser
text="CompanyName1 GmbH is from Germany. CompanyName2 Inc is from America. ComapnyName3 corp is from India."
pattern = r"""
P: {+}
"""
for sent in sent_tokenize(text):
   sentence = sent.split()
   print("Parts of speech :",pos_tag(sentence))
   PChunker = RegexpParser(pattern)
   output= PChunker.parse(pos_tag(sentence))
   for subtree in output.subtrees(filter=lambda t: t.label() == 'P'):
     # print(subtree)
     print(' '.join([x[0] for x in subtree]))

Wiktor Stribiżew · Accepted Answer

You may combine the regex and NLTK features here.

import re
...
text="CompanyName1 GmbH is from Germany. CompanyName2 Inc is from America. Comapny Name3 corp is from India."
for sent in  sent_tokenize(text):
    tagged = pos_tag(word_tokenize(sent))
    joined = ' '.join(["{}<{}>".format(word,tag) for word,tag in tagged])
    print([x.strip().replace("", "") for x in re.findall(r'((?:\S+ )+)(?:GmbH|Inc|corp)]*>', joined)])
    print('-------- NEXT SENTENCE ----------')

This outputs:

['CompanyName1']
-------- NEXT SENTENCE ----------
['CompanyName2']
-------- NEXT SENTENCE ----------
['Comapny Name3']
-------- NEXT SENTENCE ----------

The joined = ' '.join(["{}<{}>".format(word,tag) for word,tag in tagged]) part creates a temporary sentence with tags appended to the words. The regex is ((?:\S+ )+)(?:GmbH|Inc|corp)]*>, it matches

((?:\S+ )+) - Capturing group 1 (it will be the output of re.findall): 1 or more non-whitespace chars followed with and a space, all repeated 1 or more times (due to +)
(?:GmbH|Inc|corp) - a non-capturing group that matches any of the the 3 alternatives (| is an alternative operator)
]*> - a + any 0 or more chars other than > and then a >.



To get the final result, the tags should be removed from the company names, so you may just use x.strip().replace("", "") - strip the whitespace from start/end of the found match and remove the  tag using a mere str.replace method.

Apply String directly in NLTK pattern

Answers (1)

Related Questions