Jsmoka
Jsmoka

Reputation: 59

Sepration of words and count in columns and seperate then in two words

I have a data set:

Words Count
Hello,World
World,%,Hello,Germany
Germany,100,ML,Germnay

My Goal:

I would that the Code does:

Words Counts
Hello 2
World 2
% 1
100 1
ML 1
Germany 3

What I did:

The type of "CL1" is "object"

import pandas as pd
import re

separators = ","

def get_word_len(words: str) -> int:
   return len(re.split(separators, words))

df["Count"] = df.Words.apply(get_word_len)

print(df)

But it counts the number of words in every cell and NOT the frequency and count of repetition in columns.

Upvotes: 2

Views: 560

Answers (4)

AmineBTG
AmineBTG

Reputation: 697

from collection import Counter

data = ",".join(df["Words"].tolist())

counter = Counter(data.split(","))

new_df = pd.DataFrame(dict(counter))

Upvotes: 1

adir abargil
adir abargil

Reputation: 5745

you cant use the string module in pandas :

df['Words'].str.split(',').explode().value_counts()

output:

Hello      2
World      2
Germany    1
%          1
ML         1
100        1
Name: Words, dtype: int64

to make it into a dataframe:

pd.DataFrame(df['Words'].str.split(',').explode().value_counts()).reset_index().rename({'index':"Words","Words":"Count"},axis=1)

output:

    Words   Count
0   Hello   2
1   World   2
2   Germany 1
3   %       1
4   ML      1
5   100     1

Upvotes: 2

Sayandip Dutta
Sayandip Dutta

Reputation: 15872

You can use collections.Counter for this:

>>> df
            Words
0     Hello,World
1   World,%,Hello
2  Germany,100,ML

>>> pd.Series(Counter(','.join(df.Words).split(',')), 
              name='count').rename_axis(df.columns[0]).reset_index()

     Words  count
0    Hello      2
1    World      2
2        %      1
3  Germany      1
4      100      1
5       ML      1

Timing:

>>> %timeit pd.DataFrame(df['Words'].str.split(',').explode().value_counts()).reset_index().rename({'index':"Words","Words":"Count"},axis=1)
1.53 ms ± 30.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit pd.Series(Counter(','.join(df.Words).split(',')), name='count').rename_axis(df.columns[0]).reset_index()
873 µs ± 15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 2

anky
anky

Reputation: 75110

One can use the above methods, and they are efficient.

Adding another way using str.dummies with df.sum

df['Words'].str.get_dummies(",").sum()

%          1
100        1
Germany    1
Hello      2
ML         1
World      2
dtype: int64

df['Words'].str.get_dummies(",").sum().rename_axis("Words").reset_index(name='Counts')

     Words  Counts
0        %       1
1      100       1
2  Germany       1
3    Hello       2
4       ML       1
5    World       2

Upvotes: 3

Related Questions