Use DataFrame column as index and append duplicates as new columns

Question

I have a DataFrame that contains a column with dates which I'd like to use as my DataFrame's index. The dates in that column are not necessarily unique - sometimes there might be duplicates. I wish to append duplicates as new columns. Dates that are unique can just contain NaN (or whatever) for the newly appended columns.

To clarify I'll provide an example:

import pandas as pd

data = [
        {'Date':'01-01-2020','A':0,'B':1},
        {'Date':'01-01-2020','A':2,'B':3},
        {'Date':'02-01-2020','A':4,'B':5},
        {'Date':'02-01-2020','A':6,'B':7},
        {'Date':'03-01-2020','A':8,'B':9},
]

df = pd.DataFrame(data)

This will yield:

    Date        A   B
0   01-01-2020  0   1
1   01-01-2020  2   3
2   02-01-2020  4   5
3   02-01-2020  6   7
4   03-01-2020  8   9

What I want:

            A   B   C   D
Date        
01-01-2020  0   1   2   3
02-01-2020  4   5   6   7
03-01-2020  8   9   NaN NaN

The naming of the newly appended columns can be arbitrary. I don't even know whether appending would be the right way to go about it. Maybe it's easier to create a new DataFrame from scratch.

jezrael · Accepted Answer

Use DataFrame.set_index with DataFrame.stack for unpivot data and then pivoting by GroupBy.cumcount and Series.unstack:

df1 = df.set_index('Date').stack().reset_index(name='val')

df = df1.set_index(['Date', df1.groupby('Date').cumcount()])['val'].unstack()
print (df)
              0    1    2    3
Date                          
01-01-2020  0.0  1.0  2.0  3.0
02-01-2020  4.0  5.0  6.0  7.0
03-01-2020  8.0  9.0  NaN  NaN

If order of output values is not important:

df1 = df.melt('Date')

df = df1.set_index(['Date', df1.groupby('Date').cumcount()])['value'].unstack()
print (df)
              0    1    2    3
Date                          
01-01-2020  0.0  2.0  1.0  3.0
02-01-2020  4.0  6.0  5.0  7.0
03-01-2020  8.0  9.0  NaN  NaN

Another idea is use lambda function for reshape:

df = (df.set_index('Date')
       .groupby(level=0)
       .apply(lambda x: pd.Series(x.to_numpy().ravel()))
       .unstack())

print (df)
              0    1    2    3
Date                          
01-01-2020  0.0  1.0  2.0  3.0
02-01-2020  4.0  5.0  6.0  7.0
03-01-2020  8.0  9.0  NaN  NaN

Use DataFrame column as index and append duplicates as new columns

Answers (1)

Related Questions