Return the list of each word in a pandas cell and the total count of that word in the entire column

Question

I have a pandas data frame, df which looks like this:

             column1
0   apple is a fruit
1        fruit sucks
2  apple tasty fruit
3   fruits what else
4      yup apple map
5   fire in the hole
6       that is true

I want to produce a column2, which is the list of each word in the row and the total count of each word in the entire column. So the output would be something like this....

    column1            column2
0   apple is a fruit   [('apple', 3),('is', 2),('a', 1),('fruit', 3)]
1        fruit sucks   [('fruit', 3),('sucks', 1)]

I tried using the sklearn, but failing to achieve the above. Need help.

from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
x = v.fit_transform(df['text'])

erasmortg · Accepted Answer

Here is one way that gives the result you want, although avoids sklearn entirely:

def counts(data, column):
    full_list = []
    datr = data[column].tolist()
    total_words = " ".join(datr).split(' ')
    # per rows
    for i in range(len(datr)):
        #first per row get the words
        word_list = re.sub("[^\w]", " ",  datr[i]).split()
        #cycle per word
        total_row = []
        for word in word_list:
            count = []
            count = total_words.count(word)
            val = (word, count)
            total_row.append(val)
        full_list.append(total_row)
    return full_list

df['column2'] = counts(df,'column1')
df
         column1                                    column2
0   apple is a fruit  [(apple, 3), (is, 2), (a, 1), (fruit, 3)]
1        fruit sucks                   [(fruit, 3), (sucks, 1)]
2  apple tasty fruit       [(apple, 3), (tasty, 1), (fruit, 3)]
3   fruits what else        [(fruits, 1), (what, 1), (else, 1)]
4      yup apple map           [(yup, 1), (apple, 3), (map, 1)]
5   fire in the hole  [(fire, 1), (in, 1), (the, 1), (hole, 1)]
6       that is true            [(that, 1), (is, 2), (true, 1)]

Return the list of each word in a pandas cell and the total count of that word in the entire column

Answers (2)

Related Questions