Reputation: 5385
I have the following problem: I want to create a column in a dataframe summarizing all values in a row. Then I want to compare the rows of that column to create a single row containg all the values from all columns, but so that each value is only present a single time. As example: I have the following data frame
df1:
Column1 Column2
0 a 1,2,3
1 a 1,4,5
2 b 7,1,5
3 c 8,9
4 b 7,3,5
the desired output would now be:
df1_new:
Column1 Column2
0 a 1,2,3,4,5
1 b 1,3,5,7
2 c 8,9
What I am currently trying is result = df1.groupby('Column1')
, but then I don't know how to compare the values in the rows of the grouped objects and then write them to the new column and removing the duplicates. I read through the pandas documentation of Group By: split-apply-combine but could not figure out a way to do it. I also wonder if, once I have my desired output, there is a way to check in how many of the lines in the grouped object each value in Column2 of df1_new appeared. Any help on this would be greatly appreciated!
Upvotes: 3
Views: 2426
Reputation: 25189
What about this:
df1
Column1 Column2
0 a 1,2,3
1 a 1,4,5
2 b 7,1,5
3 c 8,9
4 b 7,3,5
df1.groupby('Column1').\
agg(lambda x: ','.join(x).split(','))['Column2'].\
apply(lambda x: ','.join(np.unique(x))).reset_index()
Column1 Column2
0 a 1,2,3,4,5
1 b 1,3,5,7
2 c 8,9
Upvotes: 1
Reputation: 90899
A method by which you can do this would be to apply a function on the grouped DataFrame.
This function would first convert the series (for each group) to a list, and then in the list split each string using ,
and then chain the complete list into a single list using itertools.chain.from_iterable
and then convert that to set
so that only unique values are left and then sort it and then convert back to string using str.join
. Example -
from itertools import chain
def applyfunc(x):
ch = chain.from_iterable(y.split(',') for y in x.tolist())
return ','.join(sorted(set(ch)))
df1_new = df1.groupby('Column1')['Column2'].apply(func1).reset_index()
Demo -
In [46]: df
Out[46]:
Column1 Column2
0 a 1,2,3
1 a 1,4,5
2 b 7,1,5
3 c 8,9
4 b 7,3,5
In [47]: from itertools import chain
In [48]: def applyfunc(x):
....: ch = chain.from_iterable(y.split(',') for y in x.tolist())
....: return ','.join(sorted(set(ch)))
....:
In [49]: df.groupby('Column1')['Column2'].apply(func1).reset_index()
Out[49]:
Column1 Column2
0 a 1,2,3,4,5
1 b 1,3,5,7
2 c 8,9
Upvotes: 2