Reputation: 9665
Based on the (simplified) DataFrame
import pandas as pd
texts = pd.DataFrame({"description":["This is one text","and this is another one"]})
print(texts)
description
0 This is one text
1 and this is another on
I want to create Series with the word frequency of the set of words in the description column.
The expected result should look as follows:
counts
this 2
is 2
one 2
text 1
and 1
another 1
I tried
print(pd.Series(' '.join(str(texts.description)).split(' ')).value_counts())
but got
139
e 8
t 7
i 6
n 5
o 5
s 5
d 3
a 3
h 3
p 2
: 2
c 2
r 2
\n 2
T 1
0 1
j 1
x 1
1 1
N 1
m 1
, 1
y 1
b 1
dtype: int64
Upvotes: 1
Views: 53
Reputation: 863711
If you want convert values of column to strings, use Series.astype
function:
print(pd.Series(' '.join(texts.description.astype(str)).split(' ')).value_counts())
But if all strings in column, you can also omit it and working nice:
print(pd.Series(' '.join(texts.description).split(' ')).value_counts())
one 2
is 2
This 1
text 1
this 1
and 1
another 1
dtype: int64
Upvotes: 0
Reputation: 150815
You code failed because str(texts.description)
gives:
'0 This is one text\n1 and this is another one\nName: description, dtype: object'
that is, the string expression of the series, almost equivalent to print(texts.description)
. And when you do join(str(texts.description)
, the above string is converted to list of characters, and you know the rest.
Try:
(texts.description
.str.lower()
.str.split(expand=True)
.stack().value_counts()
)
Output:
this 2
one 2
is 2
another 1
and 1
text 1
dtype: int64
Upvotes: 1
Reputation: 16916
l = texts['description'].apply(lambda x: x.lower().split())
Counter([item for sublist in l for item in sublist])
Upvotes: 1
Reputation: 1300
Remove the str
in print(pd.Series(' '.join(str(texts.description)).split(' ')).value_counts())
This is because str(texts.description)
returns
'0 This is one text\n1 and this is another one\nName: description, dtype: object'
and that is not what you want.
It works like this:
print(pd.Series(' '.join(texts.description).split(' ')).value_counts())
And gives you:
is 2
one 2
This 1
and 1
this 1
another 1
text 1
1
dtype: int64
Upvotes: 0