Reputation: 13
I would like to know how we can get the unique combination of two column values if the values are in a similar combination. Below is the dataframe
I tried using below code but my expected output is different
df.groupby(['column1', 'column2'], as_index = False).agg({'expense' : 'sum'})
Upvotes: 1
Views: 3235
Reputation: 122
This is a variant of this question, but an important distinction is that it seems you don't care about the order of column1
or column2
. Before I share the solution, here's the pseudocode:
id
column which we can use to find rows where the sets of column1
and column2
are the sameid
.Here's my manual transcription of the data. In the future, please provide the sample data as text, instead of as a screenshot.
column1,column2,salary
ram,shyam,100
sita,geeta,500
geeta,sita,300
shyam,ram,600
sohan,mohan,200
mohan,sohan,400
And here's the code
>>> import pandas as pd
>>> df = pd.read_csv('data.csv')
>>> hash_func = lambda n: hash("-".join(sorted(n)))
>>> df['id'] = df[['column1','column2']].apply(hash_func, axis=1)
>>> df
column1 column2 salary id
0 ram shyam 100 -1387604912582040812
1 sita geeta 500 9030593041392264307
2 geeta sita 300 9030593041392264307
3 shyam ram 600 -1387604912582040812
4 sohan mohan 200 6327789560655124249
5 mohan sohan 400 6327789560655124249
>>> df['expense'] = df.groupby('id')['salary'].transform('sum')
>>> df
column1 column2 salary id expense
0 ram shyam 100 7227562739062788100 700
1 sita geeta 500 6328366926112663723 800
2 geeta sita 300 6328366926112663723 800
3 shyam ram 600 7227562739062788100 700
4 sohan mohan 200 -3239226935758438599 600
5 mohan sohan 400 -3239226935758438599 600
>>> df = df.drop_duplicates(subset=['id'])
>>> df
column1 column2 salary id expense
0 ram shyam 100 7227562739062788100 700
1 sita geeta 500 6328366926112663723 800
4 sohan mohan 200 -3239226935758438599 600
>>> df = df.drop(columns=['id','salary']) # some extra cleanup
>>> df
column1 column2 expense
0 ram shyam 700
1 sita geeta 800
4 sohan mohan 600
Upvotes: 1
Reputation: 1337
You can sort the first and second columns so that a,b
and b,a
are treated as the same in groupby.
Now, since the sort()
is deprecated, we can use numpy sort and re-create a new dataframe.
Assuming the following csv_file:
column1,column2,salary
a,b,1
c,b,3
b,a,10
b,c,30
d,e,99
We can do it as follows:
import pandas as pd
import numpy as np
df = pd.read_csv("csvfile.csv",)
print("Original:\n ",df.head())
print ("\nGrouped sum:\n")
print ((pd.concat([pd.DataFrame(np.sort(df[df.columns[:2]], axis=1), columns=df.columns[:2]),\
df["salary"]], axis=1)).reset_index(drop=True, inplace=False).groupby\
(["column1", "column2"]).sum())
The output is shown below:
Original:
column1 column2 salary
0 a b 1
1 c b 3
2 b a 10
3 b c 30
4 d e 99
Grouped sum:
salary
column1 column2
a b 11
b c 33
d e 99
Upvotes: 0
Reputation: 462
I followed these steps
df['pairs'] = df['col1'] + '-' + df['col2']
Then apply a foo
function to this column
The idea of this function is to take the pairs
column data and sort it based on the 1st character of each element in the pair.
for example input is ram-shyam
or ram-shaym
we will get output as ram-shyam
This is the foo
function -
def foo(s):
lst_s = s.split('-')
temp = {}
for idx, name in enumerate(lst_s):
temp[idx]= name[0]
temp = dict(sorted(temp.items(), key=lambda item: item[1]))
final = []
for key in temp.keys():
final.append(lst_s[key])
return '-'.join(final)
Now apply this function on the pairs
column
df['unique-pair'] = df['pairs'].apply(foo)
The output looks now like this -
col1 col2 salary unique-pair pairs
0 ram shyam 100 ram-shyam ram-shyam
1 sita gita 500 gita-sita sita-gita
2 gita sita 300 gita-sita gita-sita
3 shyam ram 600 ram-shyam shyam-ram
4 sohan mohan 200 mohan-sohan sohan-mohan
5 mohan sohan 400 mohan-sohan mohan-sohan
Now you can do a group by
df.groupby(['unique-pair']).agg({'salary':sum})
final output is -
salary
unique-pair
gita-sita 800
mohan-sohan 600
ram-shyam 700
Upvotes: 0