Styuioplkj
Styuioplkj

Reputation: 3

Return the list of each word in a pandas cell and the total count of that word in the entire column

I have a pandas data frame, df which looks like this:

             column1
0   apple is a fruit
1        fruit sucks
2  apple tasty fruit
3   fruits what else
4      yup apple map
5   fire in the hole
6       that is true

I want to produce a column2, which is the list of each word in the row and the total count of each word in the entire column. So the output would be something like this....

    column1            column2
0   apple is a fruit   [('apple', 3),('is', 2),('a', 1),('fruit', 3)]
1        fruit sucks   [('fruit', 3),('sucks', 1)]

I tried using the sklearn, but failing to achieve the above. Need help.

from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
x = v.fit_transform(df['text'])

Upvotes: 0

Views: 1191

Answers (2)

Alex Olteanu
Alex Olteanu

Reputation: 4248

I don't know if you can do that using scikit-learn, but you can write a function and then use apply() to apply it on your DataFrame or Series.

Here's how you could go about it for your example:

test = pd.DataFrame(['apple is a fruit', 'fruit sucks', 'apple tasty fruit'], columns = ['A'])

def a_function(row):
    splitted_row = str(row.values[0]).split()
    word_occurences = []
    for word in splitted_row:
        column_occurences = test.A.str.count(word).sum()
        word_occurences.append((word, column_occurences))
    return word_occurences

test.apply(a_function, axis = 1)

# Output
0    [(apple, 2), (is, 1), (a, 4), (fruit, 3)]
1                     [(fruit, 3), (sucks, 1)]
2         [(apple, 2), (tasty, 1), (fruit, 3)]
dtype: object

As you can see, the main problem is that test.A.str.count(word) will count all the occurrences of word, wherever the pattern assigned to word is within the string. That's why "a" is displayed as occuring 4 times. This should probably be fixed easily with some regexes (which I'm not very good at).

Or you can use this workaround inside the function above if you are willing to lose some words:

if word not in ['a', 'is']:  # you can add here more useless words
    word_occurences.append((word, column_occurences))

Upvotes: -1

erasmortg
erasmortg

Reputation: 3278

Here is one way that gives the result you want, although avoids sklearn entirely:

def counts(data, column):
    full_list = []
    datr = data[column].tolist()
    total_words = " ".join(datr).split(' ')
    # per rows
    for i in range(len(datr)):
        #first per row get the words
        word_list = re.sub("[^\w]", " ",  datr[i]).split()
        #cycle per word
        total_row = []
        for word in word_list:
            count = []
            count = total_words.count(word)
            val = (word, count)
            total_row.append(val)
        full_list.append(total_row)
    return full_list

df['column2'] = counts(df,'column1')
df
         column1                                    column2
0   apple is a fruit  [(apple, 3), (is, 2), (a, 1), (fruit, 3)]
1        fruit sucks                   [(fruit, 3), (sucks, 1)]
2  apple tasty fruit       [(apple, 3), (tasty, 1), (fruit, 3)]
3   fruits what else        [(fruits, 1), (what, 1), (else, 1)]
4      yup apple map           [(yup, 1), (apple, 3), (map, 1)]
5   fire in the hole  [(fire, 1), (in, 1), (the, 1), (hole, 1)]
6       that is true            [(that, 1), (is, 2), (true, 1)]

Upvotes: 2

Related Questions