Reputation: 304
I have a dataframe that looks as follows:
data = [
[101, '1987-09-01', 1, 1, '1987-09-01', 2, 2],
[102, '1987-09-01', 1, 1, '1999-09-01', 2, 2],
[103, 'nan', 0, 0, '1999-09-01', 2, 2]
]
df = pd.DataFrame(data, columns=['ID', 'Date1', 'x1', 'y1', 'Date2', 'x2', 'y2'])
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
If the values for the date columns are the same in a row, then add the aggregate the x and y values. If they are not the same, split the row into two rows and keep the values like they are.
Explained in (pseudo) code:
for name in df.columns:
if 'Date' in name:
for index, row in df.iterrows():
print(row[name])
# Compare the values of the dates. See if they are equal
if date1 == date2:
# Sum the values of x1, x2. And sum the values of y1, y2
if date1 != date2:
# Group by date. Create two separate rows and do not sum the values of x and y.
An additional challenge is that there can be less or more than 2 columns containing a date. The column name will always contain the string 'Date' though. If there are for example three different date columns with three different values then the goal is to create three rows. If there is only 1 date column then no modifications have to be done.
desired_outcome = [[101, '1987-09-01', 3, 3], [102, '1987-09-01', 1, 1], [102, '1999-09-01', 2, 2], [103, '1999-09-01', 2, 2]]
df_desired_outcome = pd.DataFrame(desired_outcome, columns=['ID', 'Date', 'x', 'y'])
Upvotes: 1
Views: 154
Reputation: 863166
Use wide_to_long
for reshape first and then aggregate sum
:
df1 = pd.wide_to_long(df.reset_index(),
stubnames=['Date','x','y'],
i=['index','ID'],
j='tmp')
df1 = df1.groupby(['index','ID','Date']).sum().reset_index(level=0, drop=True).reset_index()
print (df1)
ID Date x y
0 101 1987-09-01 3 3
1 102 1987-09-01 1 1
2 102 1999-09-01 2 2
3 103 1999-09-01 2 2
4 103 nan 0 0
If ID
values are unique solution should be simplify:
df1 = pd.wide_to_long(df,
stubnames=['Date','x','y'],
i='ID',
j='tmp')
df1 = df1.groupby(['ID','Date']).sum().reset_index()
print (df1)
ID Date x y
0 101 1987-09-01 3 3
1 102 1987-09-01 1 1
2 102 1999-09-01 2 2
3 103 1999-09-01 2 2
4 103 nan 0 0
EDIT:
If columns names not ends by 1,2
like dates columns you can normalize them by first 2 letters and then apply solution above (stubnames are changed):
data = [
[101, '1987-09-01', 1, 1, '1987-09-01', 2, 2],
[102, '1987-09-01', 1, 1, '1999-09-01', 2, 2],
[103, 'nan', 0, 0, '1999-09-01', 2, 2]
]
df = pd.DataFrame(data, columns=['ID', 'Date1', 'OPxx', 'NPxy',
'Date2', 'OPyy', 'NPyx'])
s = df.columns.to_series()
m = s.str.startswith(('ID','Date'))
s1 = s[~m].str[:2]
s2 = s1.groupby(s1).cumcount().add(1).astype(str)
s[~m] = s1 + s2
print (s)
ID ID
Date1 Date1
OPxx OP1
NPxy NP1
Date2 Date2
OPyy OP2
NPyx NP2
dtype: object
df = df.rename(columns=s)
print (df)
ID Date1 OP1 NP1 Date2 OP2 NP2
0 101 1987-09-01 1 1 1987-09-01 2 2
1 102 1987-09-01 1 1 1999-09-01 2 2
2 103 nan 0 0 1999-09-01 2 2
EDIT2: I try to create to more general solution:
data = [
[101, '1987-09-01', 1, 1, '1987-09-01', 2, 2, 3],
[102, '1987-09-01', 1, 1, '1999-09-01', 2, 2, 3],
[103, 'nan', 0, 0, '1999-09-01', 2, 2, 3]
]
df = pd.DataFrame(data, columns=['ID', 'Date1', 'OPxx', 'NPxy', 'Date2',
'OPyy', 'NPyx', 'WZ'])
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
s = df.columns.to_series()
#get first 2 characters
s1 = s.str[:2]
#create groups starting by ID and Da (first 2 letters of Date)
s2 = s1.isin(['ID','Da']).cumsum().astype(str)
s = s1 + s2
print (s)
ID ID1
Date1 Da2
OPxx OP2
NPxy NP2
Date2 Da3
OPyy OP3
NPyx NP3
WZ WZ3
dtype: object
df = df.rename(columns=s)
print (df)
ID1 Da2 OP2 NP2 Da3 OP3 NP3 WZ3
0 101 1987-09-01 1 1 1987-09-01 2 2 3
1 102 1987-09-01 1 1 1999-09-01 2 2 3
2 103 NaT 0 0 1999-09-01 2 2 3
Then create subnames dynamic - all unique values of s1
with exclude ID
and index
:
print(np.setdiff1d(s1.unique(), ['ID', 'index']))
['Da' 'NP' 'OP' 'WZ']
df1 = pd.wide_to_long(df.reset_index(),
stubnames=np.setdiff1d(s1.unique(), ['ID', 'index']),
i=['index','ID1'],
j='tmp')
Aggregation sum:
df2 = (df1.groupby(['index','ID1','Da'])
.sum()
.reset_index(level=0, drop=True)
.reset_index())
print (df2)
ID1 Da NP OP WZ
0 101 1987-09-01 3 3 3.0
1 102 1987-09-01 1 1 0.0
2 102 1999-09-01 2 2 3.0
3 103 1999-09-01 2 2 3.0
Upvotes: 2