Hamperfait
Hamperfait

Reputation: 515

pandas append duplicates as columns

I have a df that looks like this

         ID       data1     data2
 index
 1       1        3         4
 2       1        2         5
 3       2        9         3
 4       3        7         2
 5       3        4         7
 6       1        10        12

What I'm trying to do is append as columns all the lines that have the same ID, so that I'd get something like this

         ID       data2     data3    data4    data5    data6     data7
 index
 1       1        3         4        2        5        10        12        
 3       2        9         3
 4       3        7         2        4        7

The problem is that I don't know how many columns I will have to append. The column. Note that ID is NOT an index but a normal column, but the one that is used to find the duplicates. I have already tried with pd.concat(), but had no luck.

Upvotes: 1

Views: 1361

Answers (1)

jezrael
jezrael

Reputation: 862511

You can use cumcount for count duplicates with set_index + unstack for reshaping. Then convert MultiIndex to columns by map and last reset_index for column ID from index.

df['g'] = df.groupby('ID').cumcount().astype(str)
df = df.set_index(['ID','g']).unstack().sort_index(axis=1, level=1)
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
   ID  data1_0  data2_0  data1_1  data2_1  data1_2  data2_2
0   1      3.0      4.0      2.0      5.0     10.0     12.0
1   2      9.0      3.0      NaN      NaN      NaN      NaN
2   3      7.0      2.0      4.0      7.0      NaN      NaN

Solution with pivot:

df['g'] = df.groupby('ID').cumcount().astype(str)
df = df.pivot(index='ID',columns='g').sort_index(axis=1, level=1)
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
   ID  data1_0  data2_0  data1_1  data2_1  data1_2  data2_2
0   1      3.0      4.0      2.0      5.0     10.0     12.0
1   2      9.0      3.0      NaN      NaN      NaN      NaN
2   3      7.0      2.0      4.0      7.0      NaN      NaN

Another solution with apply and DataFrame constructor:

df = df.groupby('ID')['data1','data2']
       .apply(lambda x: pd.DataFrame(x.values, columns=['a','b']))
       .unstack()
       .sort_index(axis=1, level=1)
df.columns = df.columns.map('{0[0]}_{0[1]}'.format)
df = df.reset_index()
print (df)
   ID  a_0  b_0  a_1  b_1   a_2   b_2
0   1  3.0  4.0  2.0  5.0  10.0  12.0
1   2  9.0  3.0  NaN  NaN   NaN   NaN
2   3  7.0  2.0  4.0  7.0   NaN   NaN

Upvotes: 1

Related Questions