Reputation: 71
I'm playing with the brown corpus, specifically the tagged sentences in "news." I've found that "to" is the word with the most ambiguous word tags (TO, IN, TO-HL, IN-HL, IN-TL, NPS). I'm trying to write a code that will print one sentence from the corpus for each tag associated with "to". The sentences do not need to be "cleaned" of the tags, but just contain both "to" and one each of the associated pos-tags.
brown_sents = nltk.corpus.brown.tagged_sents(categories="news")
for sent in brown_sents:
for word, tag in sent:
if (word == 'to' and tag == "IN"):
print sent
I tried the above code with just one of the pos-tags to see if it worked, but it prints all the instances of this. I need it to print just the first found sentence that matches the word, tag and then stop. I tried this:
for sent in brown_sents:
for word, tag in sent:
if (word == 'to' and tag == 'IN'):
print sent
if (word != 'to' and tag != 'IN'):
break
This works with this pos-tag because it's the first one related to "to", but if I use:
for sent in brown_sents:
for word, tag in sent:
if (word == 'to' and tag == 'TO-HL'):
print sent
if (word != 'to' and tag != 'TO-HL'):
break
It returns nothing. I think I am SO close -- care to help?
Upvotes: 2
Views: 2050
Reputation: 122300
With regards to why this isn't working:
for sent in brown_sents:
for word, tag in sent:
if (word == 'to' and tag == 'TO-HL'):
print sent
if (word != 'to' and tag != 'TO-HL'):
break
Before explanation, your code is not really close to the output that you desire. It's because your if-else
statements are not really doing what you need.
First you need to understand what the multiple conditions(i.e. 'if') are doing.
# Loop through the sentence
for sent in brown_sents:
# Loop through each word with its POS
for word, tag in sent:
# For each sentence checks whether word and tag is in sentence:
if word == 'to' and tag == 'TO-HL':
print sent # If the condition is true, print sent
# After checking the first if, you continue to check the second if
# if word is not 'to' and tag is not 'TO-HL',
# you want to break out of the sentence. Note that you are still
# in the same iteration as the previous condition.
if word != 'to' and tag != 'TO-HL':
break
Now let's start with some basic if-else
statement:
>>> from nltk.corpus import brown
>>> first_sent = brown.tagged_sents()[0]
>>> first_sent
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')]
>>> for word, pos in first_sent:
... if word != 'to' and pos != 'TO-HL':
... break
... else:
... print 'say hi'
...
>>>
From the example above we looped through each word+POS in the sentnece and at EVERY pair of word-pos, the if
condition will check if it is not the word 'to' and not the pos 'TO-HL' and if that is the case it breaks and never say hi
to you.
So if you keep your code with the if-else
conditions you will ALWAYS break without continuing the loop because to
is not the first word in the sentence and the matching pos is not right.
In fact, your if
condition is trying to check whether EVERY word is a 'to' and whether its POS tag is 'TO-HL'.
What you want to do is to check:
So the if
conditions you need for condition (1) is:
>>> from nltk.corpus import brown
>>> three_sents = brown.tagged_sents()[:3]
>>> for sent in three_sents:
... if 'to' in dict(sent):
... print sent
...
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]
Now you know that if 'to' in dict(sent)
checks whether 'to' is in sentence.
Then to check for condition (2):
>>> for sent in three_sents:
... if 'to' in dict(sent):
... if dict(sent)['to'] == 'TO':
... print sent
...
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]
>>> for sent in three_sents:
... if 'to' in dict(sent):
... if dict(sent)['to'] == 'TO-HL':
... print sent
...
>>>
Now you see that if dict(sent)['to'] == 'TO-HL'
AFTER you have checked that if 'to' in dict(sent)
controls the condition to check for pos restrictions.
But you realized that if you have 2 'to' in the sentence dict(sent)['to']
only captures the POS of the final 'to'. That is why you need the defaultdict(list)
as suggested in the previous answer.
There is really no clean way to perform the checks and the most efficient way is described the previous answer, sigh.
Upvotes: 1
Reputation: 122300
You can continue to add to your current code but your code didn't consider these things:
If you want to stick with your code try this:
from nltk.corpus import brown
brown_sents = brown.tagged_sents(categories="news")
def to_pos_sent(pos):
for sent in brown_sents:
for word, tag in sent:
if word == 'to' and tag == pos:
yield sent
for sent in to_pos_sent('TO'):
print sent
for sent in to_pos_sent('IN'):
print sent
I suggest that you store the sentence in a defaultdict(list)
, then you can retrieve them anytime.
from nltk.corpus import brown
from collections import Counter, defaultdict
sents_with_to = defaultdict(list)
to_counts = Counter()
for i, sent in enumerate(brown.tagged_sents(categories='news')):
# Check if 'to' is in sentence.
uniq_words = dict(sent)
if 'to' in uniq_words or 'To' in uniq_words:
# Iterate through the sentence to find 'to'
for word, pos in sent:
if word.lower()=='to':
# Flatten the sentence into a string
sents_with_to[pos].append(sent)
to_counts[pos]+=1
for pos in sents_with_to:
for sent in sents_with_to[pos]:
print pos, sent
To access the sentences of a specific POS:
for sent in sents_with_to['TO']:
print sent
You'll realized that if 'to' with a specific POS appears twice in the sentence. It's recorded twice in sents_with_to[pos]
. If you want to remove them, try:
sents_with_to_and_TO = set(" ".join(["#".join(word, pos) for word, pos in sent] for sent in sents_with_to['TO']))
Upvotes: 2