sequence_hard
sequence_hard

Reputation: 5385

How to compare column values of pandas groupby object and summarize them in a new column row

I have the following problem: I want to create a column in a dataframe summarizing all values in a row. Then I want to compare the rows of that column to create a single row containg all the values from all columns, but so that each value is only present a single time. As example: I have the following data frame

    df1:

  Column1 Column2
0    a     1,2,3
1    a     1,4,5
2    b     7,1,5
3    c     8,9
4    b     7,3,5

the desired output would now be:

df1_new:

  Column1 Column2
0    a    1,2,3,4,5
1    b    1,3,5,7
2    c    8,9

What I am currently trying is result = df1.groupby('Column1'), but then I don't know how to compare the values in the rows of the grouped objects and then write them to the new column and removing the duplicates. I read through the pandas documentation of Group By: split-apply-combine but could not figure out a way to do it. I also wonder if, once I have my desired output, there is a way to check in how many of the lines in the grouped object each value in Column2 of df1_new appeared. Any help on this would be greatly appreciated!

Upvotes: 3

Views: 2426

Answers (2)

Sergey Bushmanov
Sergey Bushmanov

Reputation: 25189

What about this:

df1
      Column1 Column2
    0       a   1,2,3
    1       a   1,4,5
    2       b   7,1,5
    3       c     8,9
    4       b   7,3,5

df1.groupby('Column1').\
agg(lambda x: ','.join(x).split(','))['Column2'].\
apply(lambda x: ','.join(np.unique(x))).reset_index()

  Column1    Column2
0       a  1,2,3,4,5
1       b    1,3,5,7
2       c        8,9

Upvotes: 1

Anand S Kumar
Anand S Kumar

Reputation: 90899

A method by which you can do this would be to apply a function on the grouped DataFrame.

This function would first convert the series (for each group) to a list, and then in the list split each string using , and then chain the complete list into a single list using itertools.chain.from_iterable and then convert that to set so that only unique values are left and then sort it and then convert back to string using str.join . Example -

from itertools import chain
def applyfunc(x):
    ch = chain.from_iterable(y.split(',') for y in x.tolist())
    return ','.join(sorted(set(ch)))

df1_new = df1.groupby('Column1')['Column2'].apply(func1).reset_index()

Demo -

In [46]: df
Out[46]:
  Column1 Column2
0       a   1,2,3
1       a   1,4,5
2       b   7,1,5
3       c     8,9
4       b   7,3,5

In [47]: from itertools import chain

In [48]: def applyfunc(x):
   ....:         ch = chain.from_iterable(y.split(',') for y in x.tolist())
   ....:         return ','.join(sorted(set(ch)))
   ....:

In [49]: df.groupby('Column1')['Column2'].apply(func1).reset_index()
Out[49]:
  Column1    Column2
0       a  1,2,3,4,5
1       b    1,3,5,7
2       c        8,9

Upvotes: 2

Related Questions