anInputName
anInputName

Reputation: 449

Efficient way to count unique elements in a column of lists?

Each row of my dataframe has a list of strings. I want to count the unique number of strings in the column. My current method is slow:

              words
0  we like to party
1  can can dance
2  yes we can
...

df["words"].apply(lambda x: len(np.unique(x, return_counts=True)[1]))

Wanted output: 7

It also doesn't check if a word occurs in 2 or more rows, which would make it even slower. Can this be done in a fast way? Thanks!

Upvotes: 1

Views: 286

Answers (2)

jezrael
jezrael

Reputation: 862511

I think you need length of sets created by joined and splitted words:

a = len(set(' '.join(df['words']).split()))
print (a)
7

If there are lists use set comprehension, thank you @juanpa.arrivillaga:

print (df)
                   words
0  [we, like, to, party]
1      [can, can, dance]
2         [yes, we, can]


a = len({y for x in df['words'] for y in x})
print (a)
7

Upvotes: 2

dukkee
dukkee

Reputation: 1122

You can use e.g. the next variant:

from itertools import chain
from operator import methodcaller

import pandas as pd

df = pd.DataFrame({
    "words": [
        "we like to party",
        "can can dance",
        "yes we can"
    ]
})

print(len(set(
    chain.from_iterable(
        map(methodcaller("split", " "), df.words.values)
    )
)))

Upvotes: 2

Related Questions