Reputation: 4055
I have a DataFrame in which multiple columns have one or comma separated value as well as some with nan values.
I would like to: 0. ignore nan
them dropping duplicated values as a single column with comma seperation.
df = pd.DataFrame({'c1' : ['a' , 'aa' , np.nan, np.nan, 'a, aa', 'abc' , 'acd' ],
'c2' : ['ab' , 'ab, abc', 'ab' , np.nan, np.nan , 'ab, a', 'ab, a'],
'c3' : ['abc', np.nan , 'a' , np.nan, 'abc' , np.nan, np.nan],
'c4' : ['abc', 'ab' , 'a' , np.nan, 'ab' , np.nan, np.nan]})
df.foo()
# Target results
0
0 a, ab, abc
1 aa, ab, abc
2 a, ab
3
4 a, aa, ab, abc
5 a, ab, abc
6 a, ab, abd
Upvotes: 1
Views: 1205
Reputation: 13349
Try:
df['new'] = df.stack().groupby(level=0).apply(lambda x: ','.join(np.unique(x.tolist())))
df['new']
:
0 a,ab,abc
1 aa,ab,ab, abc
2 a,ab
3 NaN
4 a, aa,ab,abc
5 ab, a,abc
6 ab, a,acd
Name: new, dtype: object
correction:
df['new'] = df.stack().groupby(level=0).apply(lambda x: ', '.join(np.unique(', '.join(x.tolist()).split(', '))))
df['new']
:
0 a, ab, abc
1 aa, ab, abc
2 a, ab
4 a, aa, ab, abc
5 a, ab, abc
6 a, ab, acd
dtype: object
Or you can use:
df.stack().str.split(", ").groupby(level=0).apply(lambda x: ', '.join(np.unique(np.concatenate(x.tolist()))))
Upvotes: 4
Reputation: 294258
def f(x):
x = ','.join(filter(pd.notna, x))
return ', '.join(sorted(set(map(str.strip, x.split(',')))))
pd.Series([*map(f, zip(*map(df.get, df)))])
0 a, ab, abc
1 aa, ab, abc
2 a, ab
3
4 a, aa, ab, abc
5 a, ab, abc
6 a, ab, acd
dtype: object
Upvotes: 4
Reputation: 13821
I was able to use a similar approach. sorted
will sort the items in your list of values, and set
will keep only unique values. Thereafter, a merge
with a drop
will get you a dataframe same as your desired output.
one = pd.DataFrame(df.stack().groupby(level=0).apply(lambda x: ','.join(sorted(set(x.tolist())))))
res = (pd.merge(df,one,left_index=True,right_index=True,how='left')).drop(['c1','c2','c3','c4'],axis=1)
res
Out[187]:
0
0 a,ab,abc
1 aa,ab,ab, abc
2 a,ab
3 NaN
4 a, aa,ab,abc
5 ab, a,abc
6 ab, a,acd
Upvotes: 2
Reputation: 150735
You can try your logic here:
(df.stack() # stack removes `nan`
.str.split(', ') # split by `', '`
.explode() # separate them
.groupby(level=0)
.apply(lambda x: ','.join(x.sort_values().unique()))
.reindex(df.index, fill_value='')
)
Output:
0 a, ab, abc
1 aa, ab, abc
2 a, ab
3
4 a, aa, ab, abc
5 a, ab, abc
6 a, ab, acd
dtype: object
Upvotes: 4