jski212002
jski212002

Reputation: 93

Splitting a row based on two column values

I have a Pandas DataFrame like this:

    A   B   C   Date1       Date2
0   a   b   c   02/19/2022  02/20/2022
1  aa  bb  cc   02/26/2022  02/27/2022

I want to combine Date1 and Date2 into one column, and duplicate the other column values that correspond to it. Into a dataframe that looks like this:

    A   B   C   Date
0   a   b   c   02/19/2022
1   a   b   c   02/20/2022
2  aa  bb  cc   02/26/2022
3  aa  bb  cc   02/27/2022

I know how one would do this once this dataframe has been imported into SQL using join, but is there a way to do this directly in the Pandas dataframe before uploading to a database?

Upvotes: 2

Views: 104

Answers (1)

sophocles
sophocles

Reputation: 13821

Yes, pandas offers stack() - You could set the index on your 'A', 'B' and 'C' columns:

df.set_index(['A','B','C']).stack().droplevel(level=3).rename('Date').reset_index()

prints:

    A   B   C        Date
0   a   b   c  02/19/2022
1   a   b   c  02/20/2022
2  aa  bb  cc  02/26/2022
3  aa  bb  cc  02/27/2022

Breakdown:

# Setup
>>> df.to_dict()

{'A': {0: 'a', 1: 'aa'},
 'B': {0: 'b', 1: 'bb'},
 'C': {0: 'c', 1: 'cc'},
 'Date1': {0: '02/19/2022', 1: '02/26/2022'},
 'Date2': {0: '02/20/2022', 1: '02/27/2022'}}

>>> print(df)
    A   B   C       Date1       Date2
0   a   b   c  02/19/2022  02/20/2022
1  aa  bb  cc  02/26/2022  02/27/2022

# 1) Set the index -> A,B,C
a = df.set_index(['A','B','C'])

                Date
A  B  C             
a  b  c   02/19/2022
      c   02/20/2022
aa bb cc  02/26/2022
      cc  02/27/2022

# 2) Stack the DF -> Since it has one column, it will return a Series
b = a.stack()

A   B   C       
a   b   c   Date    02/19/2022
            Date    02/20/2022
aa  bb  cc  Date    02/26/2022
            Date    02/27/2022

# 3) Remove the 3rd layer of the index
# Where it says "Date", which is the result of 'Stacking'
# Try dropping another level here
c = b.droplevel(level=3)

A   B   C 
a   b   c     02/19/2022
        c     02/20/2022
aa  bb  cc    02/26/2022
        cc    02/27/2022

# Renaming the three layered series
d = c.rename('Date')

A   B   C 
a   b   c     02/19/2022
        c     02/20/2022
aa  bb  cc    02/26/2022
        cc    02/27/2022


# Converting the index to columns
e = d.reset_index()

    A   B   C        Date
0   a   b   c  02/19/2022
1   a   b   c  02/20/2022
2  aa  bb  cc  02/26/2022
3  aa  bb  cc  02/27/2022

Documentation of the functions used:

Upvotes: 2

Related Questions