Francis Smart
Francis Smart

Reputation: 4055

Apply split join Pandas DataFrame

I have a DataFrame in which multiple columns have one or comma separated value as well as some with nan values.

I would like to: 0. ignore nan

  1. split them by ", "
  2. keep only unique
  3. sort them alphabetically
  4. join what remains into a single Series using ", " to separate values

them dropping duplicated values as a single column with comma seperation.

df = pd.DataFrame({'c1' : ['a'  , 'aa'     , np.nan, np.nan, 'a, aa', 'abc'  , 'acd'  ],
                   'c2' : ['ab' , 'ab, abc', 'ab'  , np.nan, np.nan , 'ab, a', 'ab, a'],
                   'c3' : ['abc', np.nan   , 'a'   , np.nan, 'abc'  ,  np.nan, np.nan],
                   'c4' : ['abc', 'ab'     , 'a'   , np.nan, 'ab'  ,  np.nan, np.nan]})

df.foo()
# Target results
    0
0   a, ab, abc
1   aa, ab, abc
2   a, ab 
3   
4   a, aa, ab, abc
5   a, ab, abc
6   a, ab, abd

Upvotes: 1

Views: 1205

Answers (4)

Pygirl
Pygirl

Reputation: 13349

Try:

df['new'] = df.stack().groupby(level=0).apply(lambda x: ','.join(np.unique(x.tolist())))

df['new']:

0         a,ab,abc
1    aa,ab,ab, abc
2             a,ab
3              NaN
4     a, aa,ab,abc
5        ab, a,abc
6        ab, a,acd
Name: new, dtype: object

correction:

df['new'] = df.stack().groupby(level=0).apply(lambda x: ', '.join(np.unique(', '.join(x.tolist()).split(', '))))

df['new']:

0        a, ab, abc
1       aa, ab, abc
2             a, ab
4    a, aa, ab, abc
5        a, ab, abc
6        a, ab, acd
dtype: object

Or you can use:

df.stack().str.split(", ").groupby(level=0).apply(lambda x: ', '.join(np.unique(np.concatenate(x.tolist()))))

Upvotes: 4

piRSquared
piRSquared

Reputation: 294258

This will work

def f(x):
    x = ','.join(filter(pd.notna, x))
    return ', '.join(sorted(set(map(str.strip, x.split(',')))))

pd.Series([*map(f, zip(*map(df.get, df)))])

0        a, ab, abc
1       aa, ab, abc
2             a, ab
3                  
4    a, aa, ab, abc
5        a, ab, abc
6        a, ab, acd
dtype: object

Upvotes: 4

sophocles
sophocles

Reputation: 13821

I was able to use a similar approach. sorted will sort the items in your list of values, and set will keep only unique values. Thereafter, a merge with a drop will get you a dataframe same as your desired output.

one = pd.DataFrame(df.stack().groupby(level=0).apply(lambda x: ','.join(sorted(set(x.tolist())))))
res = (pd.merge(df,one,left_index=True,right_index=True,how='left')).drop(['c1','c2','c3','c4'],axis=1)

res
Out[187]: 
               0
0       a,ab,abc
1  aa,ab,ab, abc
2           a,ab
3            NaN
4   a, aa,ab,abc
5      ab, a,abc
6      ab, a,acd

Upvotes: 2

Quang Hoang
Quang Hoang

Reputation: 150735

You can try your logic here:

(df.stack()           # stack removes `nan`
   .str.split(', ')   # split by `', '`
   .explode()         # separate them
   .groupby(level=0)
   .apply(lambda x: ','.join(x.sort_values().unique()))
   .reindex(df.index, fill_value='')
)

Output:

0        a, ab, abc
1       aa, ab, abc
2             a, ab
3                  
4    a, aa, ab, abc
5        a, ab, abc
6        a, ab, acd
dtype: object

Upvotes: 4

Related Questions