Reputation: 371
I have a data frame of tweets and I'm trying to clean my 'tweet' column- remove stop words and use lemmatization.
Below is my code:
stop_words = set(stopwords.words('english'))
lemmatizer= WordNetLemmatizer()
sentence = df['tweet'].apply(nltk.sent_tokenize)
0 [ 'country year happy']
1 [ 'wish happy year']
2 [ 'live year together']
for i in range(len(sentence)):
words=nltk.word_tokenize(str(sentence[i]))
words=[lemmatizer.lemmatize(word) for word in words if word not in
set(stopwords.words('english'))]
sentence[i]=' '.join(words)
The code above gives me the following error: (I included all the traceback)
KeyError Traceback (most recent call last)
<ipython-input-384-f4bb836363e1> in <module>
1 for i in range(len(sentence)):
----> 2 words=nltk.word_tokenize(str(sentence[i]))
3 words=[lemmatizer.lemmatize(word) for word in words if word not in
set(stopwords.words('english'))]
4 sentence[i]=' '.join(words)
~\anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
869 key = com.apply_if_callable(key, self)
870 try:
--> 871 result = self.index.get_value(self, key)
872
873 if not is_scalar(result):
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_value(self,
series, key)
4403 k = self._convert_scalar_indexer(k, kind="getitem")
4404 try:
-> 4405 return self._engine.get_value(s, k,
tz=getattr(series.dtype, "tz", None))
4406 except KeyError as e1:
4407 if len(self) > 0 and (self.holds_integer() or
self.is_boolean()):
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in
pandas._libs.hashtable.Int64HashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in
pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 34
How can I fix the error?
Also, how can I get the result in my data frame- add another column with the results?
Upvotes: 0
Views: 1536
Reputation: 56
Use sentence.iloc[i]
instead of sentence[i]
.
The KeyError
means that there's no 34
in df.index
.
sentence
is a Pandas Series; when you access sentence[i]
, Pandas will first try to use index-based indexing (df.loc
), but will fall back to location-based indexing (df.iloc
) if your index is non-numeric. So this code might work if your index happens to be non-numeric, but otherwise it's not doing what you expect. You can fix this error by explicitly using location-based indexing (df.iloc
).
For a self-contained example:
import pandas as pd
df = pd.DataFrame({'index': [10,20], 'tweets': [['hello world'],['foo bar']]}).set_index('index')
sentence = df['tweets']
for i in range(len(sentence)):
print(sentence[i])
import pandas as pd
df = pd.DataFrame({'index': [10,20], 'tweets': [['hello world'],['foo bar']]}).set_index('index')
sentence = df['tweets']
for i in range(len(sentence)):
print(sentence.iloc[i])
A tip: Rather than iterating manually through rows in Dataframes, it's usually safer and more performant to write your logic as a function and use df.apply
.
Upvotes: 1