windwalker
windwalker

Reputation: 379

find specific value that meets conditions - python

Trying to create new column with values that meet specific conditions. Below I have set out code which goes some way in explaining the logic but does not produce the correct output:

import pandas as pd
import numpy as np


df = pd.DataFrame({'date': ['2019-08-06 09:00:00', '2019-08-06 12:00:00', '2019-08-06 18:00:00', '2019-08-06 21:00:00', '2019-08-07 09:00:00', '2019-08-07 16:00:00', '2019-08-08 17:00:00' ,'2019-08-09 16:00:00'], 
                'type': [0, 1, np.nan, 1, np.nan, np.nan, 0 ,0], 
                'colour': ['blue', 'red', np.nan, 'blue', np.nan, np.nan, 'blue', 'red'],
                'maxPixel': [255, 7346, 32, 5184, 600, 322, 72, 6000],
                'minPixel': [86, 96, 14, 3540, 528, 300, 12, 4009],
                'colourDate': ['2019-08-06 12:00:00', '2019-08-08 16:00:00', '2019-08-06 23:00:00', '2019-08-06 22:00:00', '2019-08-08 09:00:00', '2019-08-09 16:00:00', '2019-08-08 23:00:00' ,'2019-08-11 16:00:00'] })

max_conditions = [(df['type'] == 1) & (df['colour'] == 'blue'),
                  (df['type'] == 1) & (df['colour'] == 'red')]


max_choices = [np.where(df['date'] <= df['colourDate'], max(df['maxPixel']), np.nan),
                np.where(df['date'] <= df['colourDate'], min(df['minPixel']), np.nan)]


df['pixelLimit'] = np.select(max_conditions, max_choices, default=np.nan)

Incorrect output:

                  date  type colour  maxPixel  minPixel           colourDate  pixelLimit
0  2019-08-06 09:00:00   0.0   blue       255        86  2019-08-06 12:00:00         NaN
1  2019-08-06 12:00:00   1.0    red      7346        96  2019-08-08 16:00:00        12.0
2  2019-08-06 18:00:00   NaN    NaN        32        14  2019-08-06 23:00:00         NaN
3  2019-08-06 21:00:00   1.0   blue      5184      3540  2019-08-06 22:00:00      6000.0
4  2019-08-07 09:00:00   NaN    NaN       600       528  2019-08-08 09:00:00         NaN
5  2019-08-07 16:00:00   NaN    NaN       322       300  2019-08-09 16:00:00         NaN
6  2019-08-08 17:00:00   0.0   blue        72        12  2019-08-08 23:00:00         NaN
7  2019-08-09 16:00:00   0.0    red      6000      4009  2019-08-11 16:00:00         NaN

Explanation why output is incorrect:

Value 12.0 in index row 1 for column df['pixelLimit'] is incorrect because this value is from df['minPixel'] index row 6 which has has a df['date'] datetime of 2019-08-08 17:00:00 which is greater than the 2019-08-08 16:00:00 df['date'] datetime contained in index row 1.

Value 6000.0 in index row 3 for column df['pixelLimit'] is incorrect because this value is from df['maxPixel'] index row 7 which has a df['date'] datetime of 2019-08-09 16:00:00 which is greater than the 2019-08-06 22:00:00 df['date'] datetime contained in index row .

Correct output:

                  date  type colour  maxPixel  minPixel           colourDate  pixelLimit
0  2019-08-06 09:00:00   0.0   blue       255        86  2019-08-06 12:00:00         NaN
1  2019-08-06 12:00:00   1.0    red      7346        96  2019-08-08 16:00:00        14.0
2  2019-08-06 18:00:00   NaN    NaN        32        14  2019-08-06 23:00:00         NaN
3  2019-08-06 21:00:00   1.0   blue      5184      3540  2019-08-06 22:00:00      5184.0
4  2019-08-07 09:00:00   NaN    NaN       600       528  2019-08-08 09:00:00         NaN
5  2019-08-07 16:00:00   NaN    NaN       322       300  2019-08-09 16:00:00         NaN
6  2019-08-08 17:00:00   0.0   blue        72        12  2019-08-08 23:00:00         NaN
7  2019-08-09 16:00:00   0.0    red      6000      4009  2019-08-11 16:00:00         NaN

Explanation why output is correct:

Value 14.0 in index row 1 for column df['pixelLimit'] is correct because we are looking for the minimum value in column df['minPixel'] which has a datetime in column df['date'] less than the datetime in index row 1 for column df['colourDate'] and greater or equal to the datetime in index row 1 for column df['date']

Value 5184.0 in index row 3 for column df['pixelLimit'] is correct because we are looking for the maximum value in column df['maxPixel'] which has a datetime in column df['date'] less than the datetime in index row 3 for column df['colourDate'] and greater or equal to the datetime in index row 3 for column df['date']

Considerations:

Maybe np.select is not best suited for this task and some sort of function might serve the task better?

Also, maybe I need to create some sort of dynamic len to use as a starting point for each row?

Request

Please can anyone out there help me amend my code to achieve the correct output

Upvotes: 4

Views: 598

Answers (1)

ALollz
ALollz

Reputation: 59579

For matching problems like this one possibility is to do the complete merge, then subset, using a Boolean Series, to all rows that satisfy your condition (for that row) and find the max or min among all the possible matches. Since this requires slightly different columns and different functions I split the operations into 2 very similar pieces of code, one to deal with 1/blue and the other for 1/red.

First some housekeeping, make things datetime

import pandas as pd

df['date'] = pd.to_datetime(df['date'])
df['colourDate'] = pd.to_datetime(df['colourDate'])

Calculate the min pixel for 1/red between the times for each row

# Subset of rows we need to do this for
dfmin = df[df.type.eq(1) & df.colour.eq('red')].reset_index()

# To each row merge all rows from the original DataFrame
dfmin = dfmin.merge(df[['date', 'minPixel']], how='cross')
# If pd.version < 1.2 instead use: 
#dfmin = dfmin.assign(t=1).merge(df[['date', 'minPixel']].assign(t=1), on='t')

# Only keep rows between the dates, then among those find the min minPixel
smin = (dfmin[dfmin.date_y.between(dfmin.date_x, dfmin.colourDate)]
            .groupby('index')['minPixel_y'].min()
            .rename('pixel_limit'))
#index
#1    14
#Name: pixel_limit, dtype: int64

# Max is basically a mirror
dfmax = df[df.type.eq(1) & df.colour.eq('blue')].reset_index()

dfmax = dfmax.merge(df[['date', 'maxPixel']], how='cross')
#dfmax = dfmax.assign(t=1).merge(df[['date', 'maxPixel']].assign(t=1), on='t')

smax = (dfmax[dfmax.date_y.between(dfmax.date_x, dfmax.colourDate)]
           .groupby('index')['maxPixel_y'].max()
           .rename('pixel_limit'))

Finally because the above groups over the original index (i.e. 'index') we can simply assign back to align with the original DataFrame.

df['pixel_limit'] = pd.concat([smin, smax])

                 date  type colour  maxPixel  minPixel          colourDate  pixel_limit
0 2019-08-06 09:00:00   0.0   blue       255        86 2019-08-06 12:00:00          NaN
1 2019-08-06 12:00:00   1.0    red      7346        96 2019-08-08 16:00:00         14.0
2 2019-08-06 18:00:00   NaN    NaN        32        14 2019-08-06 23:00:00          NaN
3 2019-08-06 21:00:00   1.0   blue      5184      3540 2019-08-06 22:00:00       5184.0
4 2019-08-07 09:00:00   NaN    NaN       600       528 2019-08-08 09:00:00          NaN
5 2019-08-07 16:00:00   NaN    NaN       322       300 2019-08-09 16:00:00          NaN
6 2019-08-08 17:00:00   0.0   blue        72        12 2019-08-08 23:00:00          NaN
7 2019-08-09 16:00:00   0.0    red      6000      4009 2019-08-11 16:00:00          NaN

If you need to bring along a lot of different information for the row with the min/max Pixel then instead of groupby min/max we will sort_values and then gropuby + head or tail to get the min or max pixel. For the min this would look like (slight renaming of suffixes):

# Subset of rows we need to do this for
dfmin = df[df.type.eq(1) & df.colour.eq('red')].reset_index()

# To each row merge all rows from the original DataFrame
dfmin = dfmin.merge(df[['date', 'minPixel']].reset_index(), how='cross', 
                    suffixes=['', '_match'])
# For older pandas < 1.2
#dfmin = (dfmin.assign(t=1)
#              .merge(df[['date', 'minPixel']].reset_index().assign(t=1), 
#                     on='t', suffixes=['', '_match'])) 

# Only keep rows between the dates, then among those find the min minPixel row. 
# A bunch of renaming. 
smin = (dfmin[dfmin.date_match.between(dfmin.date, dfmin.colourDate)]
            .sort_values('minPixel_match', ascending=True)
            .groupby('index').head(1)
            .set_index('index')
            .filter(like='_match')
            .rename(columns={'minPixel_match': 'pixel_limit'}))

The Max would then be similar using .tail

dfmax = df[df.type.eq(1) & df.colour.eq('blue')].reset_index()
dfmax = dfmax.merge(df[['date', 'maxPixel']].reset_index(), how='cross', 
                    suffixes=['', '_match'])

smax = (dfmax[dfmax.date_match.between(dfmax.date, dfmin.colourDate)]
            .sort_values('maxPixel_match', ascending=True)
            .groupby('index').tail(1)
            .set_index('index')
            .filter(like='_match')
            .rename(columns={'maxPixel_match': 'pixel_limit'}))

And finally we concat along axis=1 now that we need to join multiple columns to the original:

result = pd.concat([df, pd.concat([smin, smax])], axis=1)

                  date  type colour  maxPixel  minPixel           colourDate  index_match           date_match  pixel_limit
0  2019-08-06 09:00:00   0.0   blue       255        86  2019-08-06 12:00:00          NaN                  NaN          NaN
1  2019-08-06 12:00:00   1.0    red      7346        96  2019-08-08 16:00:00          2.0  2019-08-06 18:00:00         14.0
2  2019-08-06 18:00:00   NaN    NaN        32        14  2019-08-06 23:00:00          NaN                  NaN          NaN
3  2019-08-06 21:00:00   1.0   blue      5184      3540  2019-08-06 22:00:00          3.0  2019-08-06 21:00:00       5184.0
4  2019-08-07 09:00:00   NaN    NaN       600       528  2019-08-08 09:00:00          NaN                  NaN          NaN
5  2019-08-07 16:00:00   NaN    NaN       322       300  2019-08-09 16:00:00          NaN                  NaN          NaN
6  2019-08-08 17:00:00   0.0   blue        72        12  2019-08-08 23:00:00          NaN                  NaN          NaN
7  2019-08-09 16:00:00   0.0    red      6000      4009  2019-08-11 16:00:00          NaN                  NaN          NaN

Upvotes: 3

Related Questions