bjornvandijkman
bjornvandijkman

Reputation: 304

Split into different rows based on condition

I have a dataframe that looks as follows:

data = [
    [101, '1987-09-01', 1, 1, '1987-09-01', 2, 2],
    [102, '1987-09-01', 1, 1, '1999-09-01', 2, 2],
    [103, 'nan', 0, 0, '1999-09-01', 2, 2]
]
df = pd.DataFrame(data, columns=['ID', 'Date1', 'x1', 'y1', 'Date2', 'x2', 'y2'])
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])

My goal

If the values for the date columns are the same in a row, then add the aggregate the x and y values. If they are not the same, split the row into two rows and keep the values like they are.

Explained in (pseudo) code:

for name in df.columns:
if 'Date' in name:
    for index, row in df.iterrows():
        print(row[name])

        # Compare the values of the dates. See if they are equal
        if date1 == date2:
            # Sum the values of x1, x2. And sum the values of y1, y2

        if date1 != date2:
            # Group by date. Create two separate rows and do not sum the values of x and y.

An additional challenge is that there can be less or more than 2 columns containing a date. The column name will always contain the string 'Date' though. If there are for example three different date columns with three different values then the goal is to create three rows. If there is only 1 date column then no modifications have to be done.

Desired outcome

desired_outcome = [[101, '1987-09-01', 3, 3], [102, '1987-09-01', 1, 1], [102, '1999-09-01', 2, 2], [103, '1999-09-01', 2, 2]]
df_desired_outcome = pd.DataFrame(desired_outcome, columns=['ID', 'Date', 'x', 'y'])

Upvotes: 1

Views: 154

Answers (1)

jezrael
jezrael

Reputation: 863166

Use wide_to_long for reshape first and then aggregate sum:

df1 = pd.wide_to_long(df.reset_index(), 
                     stubnames=['Date','x','y'], 
                     i=['index','ID'], 
                     j='tmp')

df1 = df1.groupby(['index','ID','Date']).sum().reset_index(level=0, drop=True).reset_index()
print (df1)
    ID        Date  x  y
0  101  1987-09-01  3  3
1  102  1987-09-01  1  1
2  102  1999-09-01  2  2
3  103  1999-09-01  2  2
4  103         nan  0  0

If ID values are unique solution should be simplify:

df1 = pd.wide_to_long(df, 
                     stubnames=['Date','x','y'], 
                     i='ID', 
                     j='tmp')

df1 = df1.groupby(['ID','Date']).sum().reset_index()
print (df1)
    ID        Date  x  y
0  101  1987-09-01  3  3
1  102  1987-09-01  1  1
2  102  1999-09-01  2  2
3  103  1999-09-01  2  2
4  103         nan  0  0

EDIT:

If columns names not ends by 1,2 like dates columns you can normalize them by first 2 letters and then apply solution above (stubnames are changed):

data = [
    [101, '1987-09-01', 1, 1, '1987-09-01', 2, 2],
    [102, '1987-09-01', 1, 1, '1999-09-01', 2, 2],
    [103, 'nan', 0, 0, '1999-09-01', 2, 2]
]
df = pd.DataFrame(data, columns=['ID', 'Date1', 'OPxx', 'NPxy', 
                                 'Date2', 'OPyy', 'NPyx'])

s = df.columns.to_series()
m = s.str.startswith(('ID','Date'))
s1 = s[~m].str[:2]
s2 = s1.groupby(s1).cumcount().add(1).astype(str)

s[~m] = s1 + s2
print (s)
ID          ID
Date1    Date1
OPxx       OP1
NPxy       NP1
Date2    Date2
OPyy       OP2
NPyx       NP2
dtype: object

df = df.rename(columns=s)
print (df)
    ID       Date1  OP1  NP1       Date2  OP2  NP2
0  101  1987-09-01    1    1  1987-09-01    2    2
1  102  1987-09-01    1    1  1999-09-01    2    2
2  103         nan    0    0  1999-09-01    2    2

EDIT2: I try to create to more general solution:

data = [
    [101, '1987-09-01', 1, 1, '1987-09-01', 2, 2, 3],
    [102, '1987-09-01', 1, 1, '1999-09-01', 2, 2, 3],
    [103, 'nan', 0, 0, '1999-09-01', 2, 2, 3]
]
df = pd.DataFrame(data, columns=['ID', 'Date1', 'OPxx', 'NPxy', 'Date2',
                                 'OPyy', 'NPyx', 'WZ'])
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])


s = df.columns.to_series()

#get first 2 characters 
s1 = s.str[:2]
#create groups starting by ID and Da (first 2 letters of Date)
s2 = s1.isin(['ID','Da']).cumsum().astype(str)

s = s1 + s2
print (s)
ID       ID1
Date1    Da2
OPxx     OP2
NPxy     NP2
Date2    Da3
OPyy     OP3
NPyx     NP3
WZ       WZ3
dtype: object

df = df.rename(columns=s)
print (df)
   ID1        Da2  OP2  NP2        Da3  OP3  NP3  WZ3
0  101 1987-09-01    1    1 1987-09-01    2    2    3
1  102 1987-09-01    1    1 1999-09-01    2    2    3
2  103        NaT    0    0 1999-09-01    2    2    3

Then create subnames dynamic - all unique values of s1 with exclude ID and index:

print(np.setdiff1d(s1.unique(), ['ID', 'index']))
['Da' 'NP' 'OP' 'WZ']

df1 = pd.wide_to_long(df.reset_index(), 
                     stubnames=np.setdiff1d(s1.unique(), ['ID', 'index']), 
                     i=['index','ID1'], 
                     j='tmp')

Aggregation sum:

df2 = (df1.groupby(['index','ID1','Da'])
          .sum()
          .reset_index(level=0, drop=True)
          .reset_index())
print (df2)
   ID1         Da  NP  OP   WZ
0  101 1987-09-01   3   3  3.0
1  102 1987-09-01   1   1  0.0
2  102 1999-09-01   2   2  3.0
3  103 1999-09-01   2   2  3.0

Upvotes: 2

Related Questions