Reputation: 3
I have a pandas data frame, df which looks like this:
column1
0 apple is a fruit
1 fruit sucks
2 apple tasty fruit
3 fruits what else
4 yup apple map
5 fire in the hole
6 that is true
I want to produce a column2, which is the list of each word in the row and the total count of each word in the entire column. So the output would be something like this....
column1 column2
0 apple is a fruit [('apple', 3),('is', 2),('a', 1),('fruit', 3)]
1 fruit sucks [('fruit', 3),('sucks', 1)]
I tried using the sklearn, but failing to achieve the above. Need help.
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
x = v.fit_transform(df['text'])
Upvotes: 0
Views: 1191
Reputation: 4248
I don't know if you can do that using scikit-learn
, but you can write a function and then use apply()
to apply it on your DataFrame
or Series
.
Here's how you could go about it for your example:
test = pd.DataFrame(['apple is a fruit', 'fruit sucks', 'apple tasty fruit'], columns = ['A'])
def a_function(row):
splitted_row = str(row.values[0]).split()
word_occurences = []
for word in splitted_row:
column_occurences = test.A.str.count(word).sum()
word_occurences.append((word, column_occurences))
return word_occurences
test.apply(a_function, axis = 1)
# Output
0 [(apple, 2), (is, 1), (a, 4), (fruit, 3)]
1 [(fruit, 3), (sucks, 1)]
2 [(apple, 2), (tasty, 1), (fruit, 3)]
dtype: object
As you can see, the main problem is that test.A.str.count(word)
will count all the occurrences of word
, wherever the pattern assigned to word
is within the string. That's why "a"
is displayed as occuring 4 times. This should probably be fixed easily with some regexes (which I'm not very good at).
Or you can use this workaround inside the function above if you are willing to lose some words:
if word not in ['a', 'is']: # you can add here more useless words
word_occurences.append((word, column_occurences))
Upvotes: -1
Reputation: 3278
Here is one way that gives the result you want, although avoids sklearn
entirely:
def counts(data, column):
full_list = []
datr = data[column].tolist()
total_words = " ".join(datr).split(' ')
# per rows
for i in range(len(datr)):
#first per row get the words
word_list = re.sub("[^\w]", " ", datr[i]).split()
#cycle per word
total_row = []
for word in word_list:
count = []
count = total_words.count(word)
val = (word, count)
total_row.append(val)
full_list.append(total_row)
return full_list
df['column2'] = counts(df,'column1')
df
column1 column2
0 apple is a fruit [(apple, 3), (is, 2), (a, 1), (fruit, 3)]
1 fruit sucks [(fruit, 3), (sucks, 1)]
2 apple tasty fruit [(apple, 3), (tasty, 1), (fruit, 3)]
3 fruits what else [(fruits, 1), (what, 1), (else, 1)]
4 yup apple map [(yup, 1), (apple, 3), (map, 1)]
5 fire in the hole [(fire, 1), (in, 1), (the, 1), (hole, 1)]
6 that is true [(that, 1), (is, 2), (true, 1)]
Upvotes: 2