Reputation: 195
I had a sample Dataframe with column a containing the duplicated values like below:
a
0 1089, 1089, 1089
1 10A3, 10A3
2 10A3, 10A4, 10A4
3 TEL, TV
4 EZ, EZ
5 ABC Co., ABC Co.
I would like to remove the duplicates and count the value of each cell:
a count
0 1089 1
1 10A3 1
2 10A3, 10A4 2
3 TEL, TV 2
4 EZ 1
5 ABC Co. 1
Upvotes: 2
Views: 75
Reputation: 674
Try this
def f(x):
l = x.split(',')
d = {}
for key in l:
if key.rstrip() not in d:
d[key.rstrip()] = 0
d[key.rstrip()]+=1
return ','.join(list(d.keys()))
df['a_new'] = df['a'].apply(lambda x:f(x))
print(df)
df['count'] = df['a_new'].apply(lambda x: len(x.split(',')))
Upvotes: 0
Reputation: 59274
Use str.get_dummies
and sum across axis=1
df['count'] = df.a.str.get_dummies(sep=', ').sum(1)
To remove the duplicates, use explode
s = df.assign(a=df.a.str.split(', ')).explode('a').drop_duplicates()
a count
0 1089 1
1 10A3 1
2 10A3 2
2 10A4 2
3 TEL 2
3 TV 2
4 EZ 1
5 ABC Co. 1
If you really need it all in the same line...
s.groupby(s.index).agg({'a': ', '.join, 'count': 'first'})
a count
0 1089 1
1 10A3 1
2 10A3, 10A4 2
3 TEL, TV 2
4 EZ 1
5 ABC Co. 1
Or simply use @WeNYoBen ingenious solution ;)
s=df.a.str.get_dummies(sep=', ')
df['a']=s.dot(s.columns+',').str[:-1]
df['count']=s.sum(1)
Upvotes: 4
Reputation: 8631
You need to define your own method and apply it on your dataframe.
def list_count(x):
l=pd.Series(x.split(',')).str.strip().drop_duplicates().tolist()
return pd.Series([', '.join(l), len(l)])
df['a'].apply(lambda x: list_count(x)).rename(columns={0:'a', 1:'count'})
Output:
a count
0 1089 1
1 10A3 1
2 10A3, 10A4 2
3 TEL, TV 2
4 EZ 1
5 ABC Co. 1
Upvotes: 2