user1052462
user1052462

Reputation: 156

chunks using python

I'm trying to extract all the proper nouns from a tagged paragraph. What I did in my code is that first I've extracted the paragraph separately and then I have checked whether there is any proper noun in it. But the problem is, I haven't been able to extract the proper noun. My code doesn't even go inside the loop where it checks for a specific tag.

My code:

def noun(sen):
m=[]
if (sen.split('/')[1].lower().startswith('np')&sen.split('/')[1].lower().endswith('np')):
         w=sen.strip().split('/')[0]
         m.append(w)
return m


import nltk
rp = open("tesu.txt", 'r')
text = rp.read()
list = []
sentences = splitParagraph(text)
for s in sentences:
 list.append(s)

Sample input from 'tesu.txt'

Several/ap defendants/nns in/in the/at Summerdale/np police/nn burglary/nn trial/nn      made/vbd statements/nns indicating/vbg their/pp$ guilt/nn at/in the/at.... 

Bellows/np made/vbd the/at disclosure/nn when/wrb he/pps asked/vbd Judge/nn-tl Parsons/np to/to grant/vb his/pp$ client/nn ,/, Alan/np Clements/np ,/, 30/cd ,/, a/at separate/jj trial/nn ./.

How can I extract all the tagged proper nouns from a paragraph?

Upvotes: 0

Views: 451

Answers (2)

DNA
DNA

Reputation: 42617

Thanks for the data sample.

You need to:

  • read each paragraph/line
  • split the line by whitespace to extract each tagged word, e.g. Summerdale/np
  • split the word by / to see if it is tagged np
  • if so, add the other half of the split (the actual word) to your noun list

So something like the following (based on Bogdan's answer, thanks!)

def noun(word):
    nouns = []
    for word in sentence.split():
      word, tag = word.split('/')
      if (tag.lower() == 'np'):
        nouns.append(word);
    return nouns

if __name__ == '__main__':
    nouns = []
    with open('tesu.txt', 'r') as file_p:
         for sentence in file_p.read().split('\n\n'): 
              result = noun(sentence)
              if result:
                   nouns.extend(result)
    print nouns

which for your example data, produces:

['Summerdale', 'Bellows', 'Parsons', 'Alan', 'Clements']

Update: In fact, you can shorten the whole thing down to this:

nouns = []
with open('tesu.txt', 'r') as file_p:
  for word in file_p.read().split(): 
    word, tag = word.split('/')
    if (tag.lower() == 'np'):
      nouns.append(word)
print nouns

if you don't care which paragraph the nouns come from.

You could also get rid of the .lower() if the tags are always lowercase as they are in your example.

Upvotes: 1

Bogdan
Bogdan

Reputation: 8246

You should work on your code style. There are a lot of unnecessary loops in there I think. You also have a unnecessary method in splitParagraph that basically only calls the already existing split method, and you import re but never use it afterwards. Also ident you code, it's very hard to follow this way. You should provide a sample of the input from "tesu.txt" so we can help you more. Anyway all of your code there could be compact into:

 def noun(sentence);
    word, tag = sentence.split('/')
    if (tag.lower().startswith('np') and tag.lower().endswith('np')):
         return word
    return False

if __name__ == '__main__'
    words = []
    with open('tesu.txt', 'r') as file_p:
         for sentence in file_p.read().split('\n\n'): 
              result = noun(sentence)
              if result:
                   words.append(result)

Upvotes: 0

Related Questions