Oblomov
Oblomov

Reputation: 9665

Get the word frequency over all rows from a column containing texts

Based on the (simplified) DataFrame

import pandas as pd
texts = pd.DataFrame({"description":["This is one text","and this is another one"]})
print(texts)
               description
0         This is one text
1  and this is another on

I want to create Series with the word frequency of the set of words in the description column.

The expected result should look as follows:

           counts
this       2
is         2    
one        2
text       1
and        1
another    1

I tried

print(pd.Series('  '.join(str(texts.description)).split(' ')).value_counts())

but got

      139
e       8
t       7
i       6
n       5
o       5
s       5
d       3
a       3
h       3
p       2
:       2
c       2
r       2
\n      2
T       1
0       1
j       1
x       1
1       1
N       1
m       1
,       1
y       1
b       1
dtype: int64

Upvotes: 1

Views: 53

Answers (4)

jezrael
jezrael

Reputation: 863711

If you want convert values of column to strings, use Series.astype function:

print(pd.Series(' '.join(texts.description.astype(str)).split(' ')).value_counts())

But if all strings in column, you can also omit it and working nice:

print(pd.Series(' '.join(texts.description).split(' ')).value_counts())
one        2
is         2
This       1
text       1
this       1
and        1
another    1
dtype: int64

Upvotes: 0

Quang Hoang
Quang Hoang

Reputation: 150815

You code failed because str(texts.description) gives:

'0           This is one text\n1    and this is another one\nName: description, dtype: object'

that is, the string expression of the series, almost equivalent to print(texts.description). And when you do join(str(texts.description), the above string is converted to list of characters, and you know the rest.

Try:

(texts.description
      .str.lower()
      .str.split(expand=True)
      .stack().value_counts()
)

Output:

this       2
one        2
is         2
another    1
and        1
text       1
dtype: int64

Upvotes: 1

mujjiga
mujjiga

Reputation: 16916

l = texts['description'].apply(lambda x: x.lower().split())
Counter([item for sublist in l for item in sublist])

Upvotes: 1

skywalker
skywalker

Reputation: 1300

Remove the str in print(pd.Series(' '.join(str(texts.description)).split(' ')).value_counts())

This is because str(texts.description) returns '0 This is one text\n1 and this is another one\nName: description, dtype: object' and that is not what you want.

It works like this:

print(pd.Series('  '.join(texts.description).split(' ')).value_counts())

And gives you:

is         2
one        2
This       1
and        1
this       1
another    1
text       1
           1
dtype: int64

Upvotes: 0

Related Questions