Reputation: 449
Each row of my dataframe has a list of strings. I want to count the unique number of strings in the column. My current method is slow:
words
0 we like to party
1 can can dance
2 yes we can
...
df["words"].apply(lambda x: len(np.unique(x, return_counts=True)[1]))
Wanted output: 7
It also doesn't check if a word occurs in 2 or more rows, which would make it even slower. Can this be done in a fast way? Thanks!
Upvotes: 1
Views: 286
Reputation: 862511
I think you need length of sets created by joined and splitted words:
a = len(set(' '.join(df['words']).split()))
print (a)
7
If there are lists use set comprehension, thank you @juanpa.arrivillaga:
print (df)
words
0 [we, like, to, party]
1 [can, can, dance]
2 [yes, we, can]
a = len({y for x in df['words'] for y in x})
print (a)
7
Upvotes: 2
Reputation: 1122
You can use e.g. the next variant:
from itertools import chain
from operator import methodcaller
import pandas as pd
df = pd.DataFrame({
"words": [
"we like to party",
"can can dance",
"yes we can"
]
})
print(len(set(
chain.from_iterable(
map(methodcaller("split", " "), df.words.values)
)
)))
Upvotes: 2