Reputation: 162
I am looking for a solution to removing duplicate column indexes in my dataframe- what I need to do is to add the values in the duplicate columns row by row and then keep only 1 of these columns with the summed value
df = pd.DataFrame(np.array([[0,0,0,1,0,0,0], [0,1,0,0,0,0,0],
[0,0,0,0,0,0,1]]), columns=[1,1,2,2,2,3,3], index=[1,2,3])
1 1 2 2 2 3 3
1 0 0 0 1 0 0 0
2 0 1 0 0 0 0 0
3 0 0 0 0 0 0 1
should become
1 2 3
1 0 1 0
2 1 0 0
3 0 0 1
Upvotes: 4
Views: 1416
Reputation: 323226
you do not need groupby
here
df.sum(level=0,axis=1)
Out[358]:
1 2 3
1 0 1 0
2 1 0 0
3 0 0 1
Upvotes: 2
Reputation: 51165
Simply group by columns:
df.groupby(df.columns, 1).sum()
1 2 3
1 0 1 0
2 1 0 0
3 0 0 1
Or as pointed out by @user2285236
df.groupby(axis=1, level=0).sum()
Upvotes: 1
Reputation: 39042
Due to the missing data, an ugly attempt to your problem:
import pandas as pd
df = pd.DataFrame(np.array([[0,0,0,1,0,0,0], [0,1,0,0,0,0,0],
[0,0,0,0,0,0,1]]))
df.columns = [1,1,2,2,2,3,3]
df1 = df.groupby(lambda x:x, axis=1).sum()
df1.index = range(1,4)
df1
outputs the desired dataframe you posted. The following df1.index = range(1,4)
is just to re-index the rows because they start with 1
in your example.
Upvotes: 1
Reputation: 15568
Have you tried?
df = df.loc[:,~df.columns.duplicated(keep='last')]
Upvotes: 0