Quant Christo
Quant Christo

Reputation: 1430

Smart indexing by time in pandas

I have dataframe like that:

| Date    | Device | Status |
| 1990/01 | 50     | ON     |
| 1990/01 | 20     | ON     |
| 1990/03 | 25     | ON     |
| 1990/05 | 50     | OFF    |
| 2000/01 | 20     | OFF    |

Basically, I log change of devices status when it occurs, but this is not periodic so 'Date' column is not 'continous'. I want to query device status at any date with some simple logic E.g.

  1. df(device=50, date=1990/01) -> ON - easy
  2. df(device=50, date=1990/02) -> ON , there is no 1990/02 for device 50, but when device is set to specific Status, then it stays in it unless there is no change
  3. df(device=50, date=1990/05) -> OFF
  4. df(device=50, date=2000/09) -> OFF
  5. df(device=50, date=1900/01) -> OFF, this is a tricky one, before first ON, device is in OFF status

The second case is with different statuses, but to make question concise I'll keep it the same. Behavior is exactly the same except point 5. when we assume that device before first logged ON status is also in ON status. There could be more statuses but logic to decide device status is exactly the same.

How to do it in pandastic way?

Upvotes: 1

Views: 205

Answers (2)

kosnik
kosnik

Reputation: 2434

Say that your original dataframe is called status_df and that you have converted the Date column to datetime.date type. Then you can define a function and call it with your status_df as the first parameter and the desired device and check_date

def get_status(df, device, check_date):
    # Filter only those entries for this device and with date <= check_date
    filtered_df = df[(df['Device']==device) & (df['Date']<=check_date)]
    # Now sort by date
    filtered_df.sort_values('Date', inplace=True)
    # Now the last entry of the filtered_df contains the active status at check_date
    # If it is empty choose a default values (say OFF)
    if len(filtered_df) < 1:
         return 'OFF'
    return filtered_df.iloc[-1]['Status']

UPDATE In case you don't want to create a new dataframe, you can do it with a simple try statement

try:
    return df[(df['Device']==device) & (df['Date']<=check_date)].sort_values('Date').iloc[-1]['Status']
except IndexError:
    return 'OFF'

Upvotes: 1

jezrael
jezrael

Reputation: 863481

In my opinion there has to be define all possible dates for select in date_range which is used for reindex of pivoted DataFrame. NaNs are replaced first by forward filling and all first NaN values to OFF by fillna:

print (df)
      Date  Device Status
0  1990/01      50     ON
1  1990/01      20     ON
2  1990/03      25     ON
3  1990/05      50    OFF
4  1990/05      20    OFF <-changed for smaller output df

df['Date'] = pd.to_datetime(df['Date'])

rng = pd.date_range('1989-10-01', '1991-01-01', freq='MS')
df = df.pivot('Date','Device','Status').reindex(rng).ffill().fillna('OFF')
print (df)
Device       20   25   50
1989-10-01  OFF  OFF  OFF
1989-11-01  OFF  OFF  OFF
1989-12-01  OFF  OFF  OFF
1990-01-01   ON  OFF   ON
1990-02-01   ON  OFF   ON
1990-03-01   ON   ON   ON
1990-04-01   ON   ON   ON
1990-05-01  OFF   ON  OFF
1990-06-01  OFF   ON  OFF
1990-07-01  OFF   ON  OFF
1990-08-01  OFF   ON  OFF
1990-09-01  OFF   ON  OFF
1990-10-01  OFF   ON  OFF
1990-11-01  OFF   ON  OFF
1990-12-01  OFF   ON  OFF
1991-01-01  OFF   ON  OFF

Last if need original format of dates add strftime:

df.index = df.index.strftime('%Y/%m')
print (df)
Device    20   25   50
1989/10  OFF  OFF  OFF
1989/11  OFF  OFF  OFF
1989/12  OFF  OFF  OFF
1990/01   ON  OFF   ON
1990/02   ON  OFF   ON
1990/03   ON   ON   ON
1990/04   ON   ON   ON
1990/05  OFF   ON  OFF
1990/06  OFF   ON  OFF
1990/07  OFF   ON  OFF
1990/08  OFF   ON  OFF
1990/09  OFF   ON  OFF
1990/10  OFF   ON  OFF
1990/11  OFF   ON  OFF
1990/12  OFF   ON  OFF
1991/01  OFF   ON  OFF

EDIT:

More general solution:

def get_status(df, device, check_date):
    check_date = pd.to_datetime(check_date)
    df['Date'] = pd.to_datetime(df['Date'])

    rng = pd.date_range(df['Date'].min(), df['Date'].max(), freq='MS')
    df = df.pivot('Date','Device','Status').reindex(rng).ffill().fillna('OFF')
    #print (df)

    if check_date < df.index.min():
        return 'OFF'
    elif check_date > df.index.max():
        return df.loc[df.index[-1], device]
    else:
        return df.loc[check_date, device]

print (get_status(df, 50, '1990/01'))
#ON
print (get_status(df, 50, '1990/02'))
#ON
print (get_status(df, 50, '1990/05'))
#OFF
print (get_status(df, 50, '1990/09'))
#OFF
print (get_status(df, 50, '1900/01'))
#OFF

Upvotes: 1

Related Questions