Reputation: 59
I have a data set:
Words | Count |
---|---|
Hello,World | |
World,%,Hello,Germany | |
Germany,100,ML,Germnay |
I would that the Code does:
"Hello,World"
) ---> ("Hello","World"
)Words | Counts |
---|---|
Hello | 2 |
World | 2 |
% | 1 |
100 | 1 |
ML | 1 |
Germany | 3 |
The type of "CL1" is "object"
import pandas as pd
import re
separators = ","
def get_word_len(words: str) -> int:
return len(re.split(separators, words))
df["Count"] = df.Words.apply(get_word_len)
print(df)
But it counts the number of words in every cell and NOT the frequency and count of repetition in columns.
Upvotes: 2
Views: 560
Reputation: 697
from collection import Counter
data = ",".join(df["Words"].tolist())
counter = Counter(data.split(","))
new_df = pd.DataFrame(dict(counter))
Upvotes: 1
Reputation: 5745
you cant use the string module in pandas :
df['Words'].str.split(',').explode().value_counts()
output:
Hello 2
World 2
Germany 1
% 1
ML 1
100 1
Name: Words, dtype: int64
to make it into a dataframe:
pd.DataFrame(df['Words'].str.split(',').explode().value_counts()).reset_index().rename({'index':"Words","Words":"Count"},axis=1)
output:
Words Count
0 Hello 2
1 World 2
2 Germany 1
3 % 1
4 ML 1
5 100 1
Upvotes: 2
Reputation: 15872
You can use collections.Counter
for this:
>>> df
Words
0 Hello,World
1 World,%,Hello
2 Germany,100,ML
>>> pd.Series(Counter(','.join(df.Words).split(',')),
name='count').rename_axis(df.columns[0]).reset_index()
Words count
0 Hello 2
1 World 2
2 % 1
3 Germany 1
4 100 1
5 ML 1
Timing:
>>> %timeit pd.DataFrame(df['Words'].str.split(',').explode().value_counts()).reset_index().rename({'index':"Words","Words":"Count"},axis=1)
1.53 ms ± 30.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit pd.Series(Counter(','.join(df.Words).split(',')), name='count').rename_axis(df.columns[0]).reset_index()
873 µs ± 15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 2
Reputation: 75110
One can use the above methods, and they are efficient.
Adding another way using str.dummies
with df.sum
df['Words'].str.get_dummies(",").sum()
% 1
100 1
Germany 1
Hello 2
ML 1
World 2
dtype: int64
df['Words'].str.get_dummies(",").sum().rename_axis("Words").reset_index(name='Counts')
Words Counts
0 % 1
1 100 1
2 Germany 1
3 Hello 2
4 ML 1
5 World 2
Upvotes: 3