Reputation: 21
grateful for your help for what feels like a stupid question. I've pulled a sqlite table into a pandas dataframe so I can tokenize and count the frequency of words from a series of tweets.
With the code below, I can produce this for the first tweet. How do I iterate for the whole table?
conn = sqlite3.connect("tweets.sqlite")
data = pd.read_sql_query("select tweet_text from tweets_new;", conn)
tokenizer=RegexpTokenizer(r'\w+')
tokens=tokenizer.tokenize(data['tweet_text'][0])
words = nltk.FreqDist(tokens)
unigram_df = pd.DataFrame(words.most_common(),
columns=["WORD","COUNT"])
unigram_df
When I change the value to anything other than a single row, I get the following error:
TypeError: expected string or buffer
I know there are other ways of doing this, but I need to do it along these lines because of how I intend to use the output next. Thanks for any help you can provide!
I have tried:
%%time
tokenizer = RegexpTokenizer(r'\w+')
print "Cleaning the tweets...\n"
for i in xrange(0,len(df)):
if( (i+1)%1000000 == 0 ):
tokens=tokenizer.tokenize(df['tweet_text'][i])
words = nltk.FreqDist(tokens)
This looks like it should work, but still only returns words from the first row.
Upvotes: 1
Views: 1672
Reputation: 21
In case anyone is interested in this niche use case, here's the code I was eventually able to make work:
conn = sqlite3.connect("tweets.sqlite")
data = pd.read_sql_query("select tweet_text from tweets_new;", conn)
alldata = str(data)
tokenizer=RegexpTokenizer(r'\w+')
tokens=tokenizer.tokenize(alldata)
words = nltk.FreqDist(tokens)
unigram_df = pd.DataFrame(words.most_common(),
columns=["WORD","COUNT"])
Thanks for your help everyone!
Upvotes: 0
Reputation: 4487
I think your problem can be solved more concisely using CountVectorizer. I'll give you an example. Given the following inputs:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
corpus_tweets = [['I love pizza and hambuerger'],['I love apple and chips'], ['The pen is on the table!!']]
df = pd.DataFrame(corpus_tweets, columns=['tweet_text'])
You can create your bag of words template with these few lines:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df.tweet_text)
You can print the obtained vocabulary:
count_vect.vocabulary_
# ouutput: {'love': 5, 'pizza': 8, 'and': 0, 'hambuerger': 3, 'apple': 1, 'chips': 2, 'the': 10, 'pen': 7, 'is': 4, 'on': 6, 'table': 9}
and get the dataframe with word counts:
df_count = pd.DataFrame(X_train_counts.todense(), columns=count_vect.get_feature_names())
and apple chips hambuerger is love on pen pizza table the
0 1 0 0 1 0 1 0 0 1 0 0
1 1 1 1 0 0 1 0 0 0 0 0
2 0 0 0 0 1 0 1 1 0 1 2
If it is useful for you, you can merge the dataframe of the counts with the dataframe of the corpus:
pd.concat([df, df_count], axis=1)
tweet_text and apple chips hambuerger is love on \
0 I love pizza and hambuerger 1 0 0 1 0 1 0
1 I love apple and chips 1 1 1 0 0 1 0
2 The pen is on the table!! 0 0 0 0 1 0 1
pen pizza table the
0 0 1 0 0
1 0 0 0 0
2 1 0 1 2
If you want to get the dictionary containing the <word, count>
pairs for each document, at this point all you need to do is:
dict_count = df_count.T.to_dict()
{0: {'and': 1,
'apple': 0,
'chips': 0,
'hambuerger': 1,
'is': 0,
'love': 1,
'on': 0,
'pen': 0,
'pizza': 1,
'table': 0,
'the': 0},
1: {'and': 1,
'apple': 1,
'chips': 1,
'hambuerger': 0,
'is': 0,
'love': 1,
'on': 0,
'pen': 0,
'pizza': 0,
'table': 0,
'the': 0},
2: {'and': 0,
'apple': 0,
'chips': 0,
'hambuerger': 0,
'is': 1,
'love': 0,
'on': 1,
'pen': 1,
'pizza': 0,
'table': 1,
'the': 2}}
Note: turning X_train_counts
which is a sparse numpy matrix into a dataframe is not a good idea. But it can be useful to understand and visualize the various steps of your model.
Upvotes: 1
Reputation: 1217
After creating the DataFrame
loop over all the rows:
tokenizer = RegexpTokenizer(r'\w+')
fdist = FreqDist()
for txt in data['tweet_text']:
for word in tokenizer.tokenize(txt):
fdist[word.lower()] += 1
Upvotes: 0