Reputation: 83

Iterating over a text column in a dataframe

Hi all. I am working on a dataframe (picture above) with over 18000 observations. What I'd like to do is to get the text in the column 'review' one after the other and then do a word count later on it. At the moment I have been trying to iterate over it but I have been getting error like "TypeError: 'float' object is not iterable". Here is the code I used:

def tokenize(text):
    for row in text:
        for i in row:
            if i is not None:
                words = i.lower().split()
                return words
            else:
                return None

data['review_two'] = data['review'].apply(tokenize)

Now my question is: how do I iterate effectively and efficiently over the column 'review' so that I can now preprocess each row one after the other before I now perform word count on it?

Upvotes: 1

Answers (3)

Tai

Reputation: 7994

My hypothesis for the error is that you have missing data, which is NaN and makes tokenize function fail. You can checkt it with pd.isnull(df["review"]), which will show you a boolean array that whether each line is NaN. If any(pd.isnull(df["review"])) is true, then there is a missing value in the column.

I cannot reproduce the error as I don't have the data, but I think your goal can be achieve with this.

from collections import Counter
df = pd.DataFrame([{"name": "A", "review": "No it is not good.", "rating":2},
                {"name": "B", "review": "Awesome!", "rating":5},
                 {"name": "C", "review": "This is fine.", "rating":3},
                 {"name": "C", "review": "This is fine.", "rating":3}])

# first .lower and then .replace for punctuations and finally .split to get lists
df["splitted"] = df.review.str.lower().str.replace('[^\w\s]','').str.split()

# pass a counter to count every list. Then sum counters. (Counters can be added.)
df["splitted"].transform(lambda x: Counter(x)).sum()

Counter({'awesome': 1,
     'fine': 2,
     'good': 1,
     'is': 3,
     'it': 1,
     'no': 1,
     'not': 1,
     'this': 2})

str.replace part is to remove punctuations see the answer Replacing punctuation in a data frame based on punctuation list from @EdChum

Upvotes: 1

SamuelNLP

Reputation: 4136

Maybe something like this, that gives you the word count, the rest I did not understand what you want.

import pandas as pd

a = ['hello friend', 'a b c d']
b = pd.DataFrame(a)

print(b[0].str.split().str.len())

>> 0    2
   1    4

Upvotes: 0

nnnmmm

Reputation: 8754

I'm not sure what you're trying to do, especially with for i in row. In any case, apply already iterates over the rows of your DataFrame/Series, so there's no need to do it in the function that you pass to apply.

Besides, your code does not return a TypeError for a DataFrame such as yours where the columns contain strings. See here for how to check if your 'review' column contains only text.

Upvotes: 0

Iterating over a text column in a dataframe

Answers (3)

Related Questions