Reputation: 379
Trying to create new column with values that meet specific conditions. Below I have set out code which goes some way in explaining the logic but does not produce the correct output:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': ['2019-08-06 09:00:00', '2019-08-06 12:00:00', '2019-08-06 18:00:00', '2019-08-06 21:00:00', '2019-08-07 09:00:00', '2019-08-07 16:00:00', '2019-08-08 17:00:00' ,'2019-08-09 16:00:00'],
'type': [0, 1, np.nan, 1, np.nan, np.nan, 0 ,0],
'colour': ['blue', 'red', np.nan, 'blue', np.nan, np.nan, 'blue', 'red'],
'maxPixel': [255, 7346, 32, 5184, 600, 322, 72, 6000],
'minPixel': [86, 96, 14, 3540, 528, 300, 12, 4009],
'colourDate': ['2019-08-06 12:00:00', '2019-08-08 16:00:00', '2019-08-06 23:00:00', '2019-08-06 22:00:00', '2019-08-08 09:00:00', '2019-08-09 16:00:00', '2019-08-08 23:00:00' ,'2019-08-11 16:00:00'] })
max_conditions = [(df['type'] == 1) & (df['colour'] == 'blue'),
(df['type'] == 1) & (df['colour'] == 'red')]
max_choices = [np.where(df['date'] <= df['colourDate'], max(df['maxPixel']), np.nan),
np.where(df['date'] <= df['colourDate'], min(df['minPixel']), np.nan)]
df['pixelLimit'] = np.select(max_conditions, max_choices, default=np.nan)
Incorrect output:
date type colour maxPixel minPixel colourDate pixelLimit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 12.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 6000.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN
Explanation why output is incorrect:
Value 12.0
in index row 1 for column df['pixelLimit']
is incorrect because this value is from df['minPixel']
index row 6 which has has a df['date']
datetime of 2019-08-08 17:00:00
which is greater than the 2019-08-08 16:00:00
df['date']
datetime contained in index row 1.
Value 6000.0
in index row 3 for column df['pixelLimit']
is incorrect because this value is from df['maxPixel']
index row 7 which has a df['date']
datetime of 2019-08-09 16:00:00
which is greater than the 2019-08-06 22:00:00
df['date']
datetime contained in index row .
Correct output:
date type colour maxPixel minPixel colourDate pixelLimit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 14.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 5184.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN
Explanation why output is correct:
Value 14.0
in index row 1 for column df['pixelLimit']
is correct because we are looking for the minimum value in column df['minPixel']
which has a datetime in column df['date']
less than the datetime in index row 1 for column df['colourDate']
and greater or equal to the datetime in index row 1 for column df['date']
Value 5184.0
in index row 3 for column df['pixelLimit']
is correct because we are looking for the maximum value in column df['maxPixel']
which has a datetime in column df['date']
less than the datetime in index row 3 for column df['colourDate']
and greater or equal to the datetime in index row 3 for column df['date']
Considerations:
Maybe np.select
is not best suited for this task and some sort of function might serve the task better?
Also, maybe I need to create some sort of dynamic len
to use as a starting point for each row?
Request
Please can anyone out there help me amend my code to achieve the correct output
Upvotes: 4
Views: 598
Reputation: 59579
For matching problems like this one possibility is to do the complete merge, then subset, using a Boolean Series, to all rows that satisfy your condition (for that row) and find the max
or min
among all the possible matches. Since this requires slightly different columns and different functions I split the operations into 2 very similar pieces of code, one to deal with 1/blue and the other for 1/red.
First some housekeeping, make things datetime
import pandas as pd
df['date'] = pd.to_datetime(df['date'])
df['colourDate'] = pd.to_datetime(df['colourDate'])
Calculate the min pixel for 1/red between the times for each row
# Subset of rows we need to do this for
dfmin = df[df.type.eq(1) & df.colour.eq('red')].reset_index()
# To each row merge all rows from the original DataFrame
dfmin = dfmin.merge(df[['date', 'minPixel']], how='cross')
# If pd.version < 1.2 instead use:
#dfmin = dfmin.assign(t=1).merge(df[['date', 'minPixel']].assign(t=1), on='t')
# Only keep rows between the dates, then among those find the min minPixel
smin = (dfmin[dfmin.date_y.between(dfmin.date_x, dfmin.colourDate)]
.groupby('index')['minPixel_y'].min()
.rename('pixel_limit'))
#index
#1 14
#Name: pixel_limit, dtype: int64
# Max is basically a mirror
dfmax = df[df.type.eq(1) & df.colour.eq('blue')].reset_index()
dfmax = dfmax.merge(df[['date', 'maxPixel']], how='cross')
#dfmax = dfmax.assign(t=1).merge(df[['date', 'maxPixel']].assign(t=1), on='t')
smax = (dfmax[dfmax.date_y.between(dfmax.date_x, dfmax.colourDate)]
.groupby('index')['maxPixel_y'].max()
.rename('pixel_limit'))
Finally because the above groups over the original index (i.e. 'index'
) we can simply assign back to align with the original DataFrame.
df['pixel_limit'] = pd.concat([smin, smax])
date type colour maxPixel minPixel colourDate pixel_limit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 14.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 5184.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN
If you need to bring along a lot of different information for the row with the min/max Pixel then instead of groupby
min/max
we will sort_values and then gropuby
+ head
or tail
to get the min or max pixel. For the min this would look like (slight renaming of suffixes):
# Subset of rows we need to do this for
dfmin = df[df.type.eq(1) & df.colour.eq('red')].reset_index()
# To each row merge all rows from the original DataFrame
dfmin = dfmin.merge(df[['date', 'minPixel']].reset_index(), how='cross',
suffixes=['', '_match'])
# For older pandas < 1.2
#dfmin = (dfmin.assign(t=1)
# .merge(df[['date', 'minPixel']].reset_index().assign(t=1),
# on='t', suffixes=['', '_match']))
# Only keep rows between the dates, then among those find the min minPixel row.
# A bunch of renaming.
smin = (dfmin[dfmin.date_match.between(dfmin.date, dfmin.colourDate)]
.sort_values('minPixel_match', ascending=True)
.groupby('index').head(1)
.set_index('index')
.filter(like='_match')
.rename(columns={'minPixel_match': 'pixel_limit'}))
The Max would then be similar using .tail
dfmax = df[df.type.eq(1) & df.colour.eq('blue')].reset_index()
dfmax = dfmax.merge(df[['date', 'maxPixel']].reset_index(), how='cross',
suffixes=['', '_match'])
smax = (dfmax[dfmax.date_match.between(dfmax.date, dfmin.colourDate)]
.sort_values('maxPixel_match', ascending=True)
.groupby('index').tail(1)
.set_index('index')
.filter(like='_match')
.rename(columns={'maxPixel_match': 'pixel_limit'}))
And finally we concat along axis=1
now that we need to join multiple columns to the original:
result = pd.concat([df, pd.concat([smin, smax])], axis=1)
date type colour maxPixel minPixel colourDate index_match date_match pixel_limit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN NaN NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 2.0 2019-08-06 18:00:00 14.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN NaN NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 3.0 2019-08-06 21:00:00 5184.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN NaN NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN NaN NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN NaN NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN NaN NaN
Upvotes: 3