Reputation: 83
Hi all. I am working on a dataframe (picture above) with over 18000 observations. What I'd like to do is to get the text in the column 'review' one after the other and then do a word count later on it. At the moment I have been trying to iterate over it but I have been getting error like "TypeError: 'float' object is not iterable"
. Here is the code I used:
def tokenize(text):
for row in text:
for i in row:
if i is not None:
words = i.lower().split()
return words
else:
return None
data['review_two'] = data['review'].apply(tokenize)
Now my question is: how do I iterate effectively and efficiently over the column 'review' so that I can now preprocess each row one after the other before I now perform word count on it?
Upvotes: 1
Views: 1525
Reputation: 7994
My hypothesis for the error is that you have missing data, which is NaN
and makes tokenize
function fail. You can checkt it with pd.isnull(df["review"])
, which will show you a boolean array that whether each line is NaN
. If any(pd.isnull(df["review"]))
is true, then there is a missing value in the column.
I cannot reproduce the error as I don't have the data, but I think your goal can be achieve with this.
from collections import Counter
df = pd.DataFrame([{"name": "A", "review": "No it is not good.", "rating":2},
{"name": "B", "review": "Awesome!", "rating":5},
{"name": "C", "review": "This is fine.", "rating":3},
{"name": "C", "review": "This is fine.", "rating":3}])
# first .lower and then .replace for punctuations and finally .split to get lists
df["splitted"] = df.review.str.lower().str.replace('[^\w\s]','').str.split()
# pass a counter to count every list. Then sum counters. (Counters can be added.)
df["splitted"].transform(lambda x: Counter(x)).sum()
Counter({'awesome': 1,
'fine': 2,
'good': 1,
'is': 3,
'it': 1,
'no': 1,
'not': 1,
'this': 2})
str.replace
part is to remove punctuations see the answer Replacing punctuation in a data frame based on punctuation list from @EdChum
Upvotes: 1
Reputation: 4136
Maybe something like this, that gives you the word count, the rest I did not understand what you want.
import pandas as pd
a = ['hello friend', 'a b c d']
b = pd.DataFrame(a)
print(b[0].str.split().str.len())
>> 0 2
1 4
Upvotes: 0
Reputation: 8754
I'm not sure what you're trying to do, especially with for i in row
. In any case, apply already iterates over the rows of your DataFrame/Series, so there's no need to do it in the function that you pass to apply
.
Besides, your code does not return a TypeError for a DataFrame such as yours where the columns contain strings. See here for how to check if your 'review' column contains only text.
Upvotes: 0