Reputation: 1739
I am trying to remove stopwords from tweets that I have imported from Twitter. After removing the stopwords, the list of strings will be placed in a new column in the same row. I can easily accomplish this one row at a time however when attempting to loop the method over the whole Data Frame does not seem to succeed.
How do would I do this?
Snippet of my data:
tweets['text'][0:5]
Out[21]:
0 Why #litecoin will go over 50 USD soon ? So ma...
1 get 20 free #bitcoin spins at...
2 Are you Bullish or Bearish on #BMW? Start #Tra...
3 Are you Bullish or Bearish on the S&P 500?...
4 TIL that there is a DAO ExtraBalance Refund. M...
The following works in a single row scenario:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
tweets['text-filtered'] = ""
word_tokens = word_tokenize(tweets['text'][1])
filtered_sentence = [w for w in word_tokens if not w in stop_words]
tweets['text-filtered'][1] = filtered_sentence
tweets['text-filtered'][1]
Out[22]:
['get',
'20',
'free',
'#',
'bitcoin',
'spins',
'withdraw',
'free',
'#',
'btc',
'#',
'freespins',
'#',
'nodeposit',
'#',
'casino',
'#',
'...',
':']
My attempt at a loop does not succeed:
for i in tweets:
word_tokens = word_tokenize(tweets.get(tweets['text'][i], False))
filtered_sentence = [w for w in word_tokens if not w in stop_words]
tweets['text-filtered'][i] = filtered_sentence
A snippet of the traceback:
Traceback (most recent call last):
File "<ipython-input-23-6d7dace7a2d0>", line 2, in <module>
word_tokens = word_tokenize(tweets.get(tweets['text'][i], False))
...
KeyError: 'id'
Based off @Prune's reply, I have managed to correct my mistakes. Here is a potential solution:
count = 0
for i in tweets['text']:
word_tokens = word_tokenize(i)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
tweets['text-filtered'][count] = filtered_sentence
count += 1
My previous attempt was looping through the columns of the Data Frame, tweets. The first column in tweets was "id".
tweets.columns
Out[30]:
Index(['id', 'user_bg_color', 'created', 'geo', 'user_created', 'text',
'polarity', 'user_followers', 'user_location', 'retweet_count',
'id_str', 'user_name', 'subjectivity', 'coordinates',
'user_description', 'text-filtered'],
dtype='object')
Upvotes: 0
Views: 3608
Reputation: 77860
You're confused about list indexing:
for i in tweets:
word_tokens = word_tokenize(tweets.get(tweets['text'][i], False))
filtered_sentence = [w for w in word_tokens if not w in stop_words]
tweets['text-filtered'][i] = filtered_sentence
Note that tweets
is a dictionary; tweets['text']
list of strings. Thus, for i in tweets
returns all of the keys in tweets
: the dictionary keys in arbitrary order. It appears that "id" is the first one returned. When you try to assign tweets['text-filtered']['id'] = filtered_sentence
, there just is no such element.
Try coding more gently: start at the inside, code a few lines at a time, and work your way up to more complex control structures. Debug each addition before you go on. Here, you seem to have lost your sense of what is a numeric index, what is a list, and what is a dictionary.
Since you haven't done any visible debugging, or provided the context, I can't fix the whole program for you -- but this should get you started.
Upvotes: 2