I have a data set: Words Count Hello,World World,%,Hello,Germany Germany,100,ML,Germnay My Goal: I would that the Code does: Separate the Words: ( "Hello,World" ) ---> ( "Hello","World" ) Lists all separated Words in new columns behind each other Count the frequency of Words and put the results in "Count" e.g. it finds two times the world "Hello" in column "Words" Words Counts Hello 2 World 2 % 1 100 1 ML 1 Germany 3 What I did: The type of "CL1" is "object" import pandas as pd import re separators = "," def get_word_len(words: str) -> int: return len(re.split(separators, words)) df["Count"] = df.Words.apply(get_word_len) print(df) But it counts the number of words in every cell and NOT the frequency and count of repetition in columns.

Reputation: 59

Sepration of words and count in columns and seperate then in two words

I have a data set:

Words	Count
Hello,World
World,%,Hello,Germany
Germany,100,ML,Germnay

My Goal:

I would that the Code does:

Separate the Words: ("Hello,World") ---> ("Hello","World")
Lists all separated Words in new columns behind each other
Count the frequency of Words and put the results in "Count" e.g. it finds two times the world "Hello" in column "Words"

Words	Counts
Hello	2
World	2
%	1
100	1
ML	1
Germany	3

What I did:

The type of "CL1" is "object"

import pandas as pd
import re

separators = ","

def get_word_len(words: str) -> int:
   return len(re.split(separators, words))

df["Count"] = df.Words.apply(get_word_len)

print(df)

But it counts the number of words in every cell and NOT the frequency and count of repetition in columns.

Upvotes: 2

Answers (4)

AmineBTG

Reputation: 697

from collection import Counter

data = ",".join(df["Words"].tolist())

counter = Counter(data.split(","))

new_df = pd.DataFrame(dict(counter))

Upvotes: 1

adir abargil

Reputation: 5745

you cant use the string module in pandas :

df['Words'].str.split(',').explode().value_counts()

output:

Hello      2
World      2
Germany    1
%          1
ML         1
100        1
Name: Words, dtype: int64

to make it into a dataframe:

pd.DataFrame(df['Words'].str.split(',').explode().value_counts()).reset_index().rename({'index':"Words","Words":"Count"},axis=1)

output:

    Words   Count
0   Hello   2
1   World   2
2   Germany 1
3   %       1
4   ML      1
5   100     1

Upvotes: 2

Sayandip Dutta

Reputation: 15872

You can use collections.Counter for this:

>>> df
            Words
0     Hello,World
1   World,%,Hello
2  Germany,100,ML

>>> pd.Series(Counter(','.join(df.Words).split(',')), 
              name='count').rename_axis(df.columns[0]).reset_index()

     Words  count
0    Hello      2
1    World      2
2        %      1
3  Germany      1
4      100      1
5       ML      1

Timing:

>>> %timeit pd.DataFrame(df['Words'].str.split(',').explode().value_counts()).reset_index().rename({'index':"Words","Words":"Count"},axis=1)
1.53 ms ± 30.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit pd.Series(Counter(','.join(df.Words).split(',')), name='count').rename_axis(df.columns[0]).reset_index()
873 µs ± 15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 2

anky

Reputation: 75110

One can use the above methods, and they are efficient.

Adding another way using str.dummies with df.sum

df['Words'].str.get_dummies(",").sum()

%          1
100        1
Germany    1
Hello      2
ML         1
World      2
dtype: int64

df['Words'].str.get_dummies(",").sum().rename_axis("Words").reset_index(name='Counts')

     Words  Counts
0        %       1
1      100       1
2  Germany       1
3    Hello       2
4       ML       1
5    World       2

Upvotes: 3

Sepration of words and count in columns and seperate then in two words

My Goal:

What I did:

Answers (4)

Related Questions