secumind
secumind

Reputation: 1141

How do I switch my input default dictionary to lower case for NLTK comparison in Python

I have a python dict that looks like:

defaultdict(<type 'int'>, {u'RT': 1, u'be': 1, u'uniforms': 1, u'@ProFootballWkly:': 1, u'in': 1, u'Nike': 1, u'Brooklyn.': 1, u'ET': 1, u"NFL's": 1, u'will': 1, u'a.m.': 1, u'at': 1, u'unveiled': 1, u'Jimmy': 3, u'11': 1, u'new': 1, u'The': 2, u'today': 1})

I'm processing it with:

freq_distribution = nltk.FreqDist(filtered_words)               
top_words = freq_distribution.keys()[:4]     
print top_words                 

This outputs the top 4 words which includes the word "The" I am trying to incorporate removal of Dolch "commonly used" words before this process happens with:

filtered_words = [w for w in word_count \
              if not w in stopwords.words('english')]

The problem is that I still end up with the word "The" because all of the (stopwords) from NLTK are lowercase. I need a way to take the input of word_count and switch it to lower case. I have tried adding lower() in various areas such as:

freq_distribution = nltk.FreqDist(word_count.lower())

But have not had any success, as I repeatedly get the following error:

AttributeError: 'list' object has no attribute 'lower'

Upvotes: 1

Views: 534

Answers (1)

dhg
dhg

Reputation: 52681

filtered_words = [w for w in word_count \
          if w.lower() not in stopwords.words('english')]

This lowercases w before checking whether it's in the stopwords list. So if w is "The", it will be transformed to the before checking. Since "the" is in the list, it will get filtered out.

Upvotes: 4

Related Questions