Lili
Lili

Reputation: 371

KeyError when cleaning tweets column using stop words in python

I have a data frame of tweets and I'm trying to clean my 'tweet' column- remove stop words and use lemmatization.

Below is my code:

stop_words = set(stopwords.words('english'))
lemmatizer= WordNetLemmatizer()

sentence = df['tweet'].apply(nltk.sent_tokenize)

 0 [ 'country year happy']
 1 [ 'wish happy year']
 2 [ 'live year together']

for i in range(len(sentence)): 
    words=nltk.word_tokenize(str(sentence[i]))
    words=[lemmatizer.lemmatize(word) for word in words if word not in 
          set(stopwords.words('english'))]
    sentence[i]=' '.join(words)

The code above gives me the following error: (I included all the traceback)

 KeyError  Traceback (most recent call last)
<ipython-input-384-f4bb836363e1> in <module>
  1 for i in range(len(sentence)):
----> 2     words=nltk.word_tokenize(str(sentence[i]))
  3     words=[lemmatizer.lemmatize(word) for word in words if word not in 
      set(stopwords.words('english'))]
  4     sentence[i]=' '.join(words)

~\anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
   869         key = com.apply_if_callable(key, self)
   870         try:
   --> 871     result = self.index.get_value(self, key)
   872 
   873             if not is_scalar(result):

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_value(self, 
  series, key)
  4403         k = self._convert_scalar_indexer(k, kind="getitem")
  4404         try:
  -> 4405             return self._engine.get_value(s, k, 
  tz=getattr(series.dtype, "tz", None))
  4406         except KeyError as e1:
  4407             if len(self) > 0 and (self.holds_integer() or 
  self.is_boolean()):

  pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()

  pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()

  pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

  pandas\_libs\hashtable_class_helper.pxi in 
  pandas._libs.hashtable.Int64HashTable.get_item()

  pandas\_libs\hashtable_class_helper.pxi in 
  pandas._libs.hashtable.Int64HashTable.get_item()

  KeyError: 34

How can I fix the error?

Also, how can I get the result in my data frame- add another column with the results?

Upvotes: 0

Views: 1536

Answers (1)

Eric Doi
Eric Doi

Reputation: 56

Use sentence.iloc[i] instead of sentence[i].

Explanation

The KeyError means that there's no 34 in df.index.

sentence is a Pandas Series; when you access sentence[i], Pandas will first try to use index-based indexing (df.loc), but will fall back to location-based indexing (df.iloc) if your index is non-numeric. So this code might work if your index happens to be non-numeric, but otherwise it's not doing what you expect. You can fix this error by explicitly using location-based indexing (df.iloc).

For a self-contained example:

Doesn't work

import pandas as pd
df = pd.DataFrame({'index': [10,20], 'tweets': [['hello world'],['foo bar']]}).set_index('index')
sentence = df['tweets']

for i in range(len(sentence)):
    print(sentence[i])

Works

import pandas as pd
df = pd.DataFrame({'index': [10,20], 'tweets': [['hello world'],['foo bar']]}).set_index('index')
sentence = df['tweets']

for i in range(len(sentence)):
    print(sentence.iloc[i])

A tip: Rather than iterating manually through rows in Dataframes, it's usually safer and more performant to write your logic as a function and use df.apply.

Upvotes: 1

Related Questions