Reputation: 159
I have a df that has one column with multiple comma-separated values in each row. I want to count how many times a unique value occurs in that column.
The df looks like this:
category country
0 widget1, widget2, widget3, widget4 USA
1 widget1, widget3 USA
2 widget1, widget2 China
3 widget2 Canada
4 widget1, widget2, widget3 China
5 widget2 Vietnam
6 widget3 Canada
7 widget1, widget3 USA
8 widget1, widget3 Japan
9 widget2 Germany
I want know how many times each widget appears in the column "category". The results in this example would be:
widget1 = 6, widget2 = 6, widget3 = 6, widget4 = 1
I can use .value_counts
df["category"].value_counts()
but that's only going to return rows that are exactly the same.
I could use value_counts and enter each value for it to count, but in the actual DataFrame there are too many rows and unique values in that column to make it practical.
Also, is there a way to not double count if a single row contains two values that are the same? For example is there was a "widget1, black widget1, yellow widget1" in the same row, I'd just want to count that as one widget1.
Upvotes: 2
Views: 1505
Reputation: 1120
This might not be the most elegant solution but I think it should work. Basically we need to separate each word in the Category column and then count the words.
from itertools import chain
words=[i.split(',') for i in df['Category'].tolist()]
words=[i.strip() for i in chain.from_iterable(words)]
pd.Series(words).value_counts()
Upvotes: 0
Reputation: 42886
Another solution would be to unnest your string to rows, then use value_counts
:
explode_str(df, 'category', ',').value_counts()
widget2 6
widget1 6
widget3 6
widget4 1
Name: category, dtype: int64
Function used from linked answer:
def explode_str(df, col, sep):
s = df[col]
i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
Upvotes: 1
Reputation: 323226
Do with get_dummies
df.category.str.get_dummies(',').replace(0,np.nan).stack().sum(level=1)
Upvotes: 4