Reputation: 1430
I have dataframe like that:
| Date | Device | Status |
| 1990/01 | 50 | ON |
| 1990/01 | 20 | ON |
| 1990/03 | 25 | ON |
| 1990/05 | 50 | OFF |
| 2000/01 | 20 | OFF |
Basically, I log change of devices status when it occurs, but this is not periodic so 'Date' column is not 'continous'. I want to query device status at any date with some simple logic E.g.
df(device=50, date=1990/01) -> ON
- easydf(device=50, date=1990/02) -> ON
, there is no 1990/02 for device 50, but when device is set to specific Status, then it stays in it unless there is no changedf(device=50, date=1990/05) -> OFF
df(device=50, date=2000/09) -> OFF
df(device=50, date=1900/01) -> OFF
, this is a tricky one, before first ON, device is in OFF statusThe second case is with different statuses, but to make question concise I'll keep it the same. Behavior is exactly the same except point 5. when we assume that device before first logged ON status is also in ON status. There could be more statuses but logic to decide device status is exactly the same.
How to do it in pandastic way?
Upvotes: 1
Views: 205
Reputation: 2434
Say that your original dataframe is called status_df
and that you have converted the Date
column to datetime.date
type. Then you can define a function and call it with your status_df
as the first parameter and the desired device
and check_date
def get_status(df, device, check_date):
# Filter only those entries for this device and with date <= check_date
filtered_df = df[(df['Device']==device) & (df['Date']<=check_date)]
# Now sort by date
filtered_df.sort_values('Date', inplace=True)
# Now the last entry of the filtered_df contains the active status at check_date
# If it is empty choose a default values (say OFF)
if len(filtered_df) < 1:
return 'OFF'
return filtered_df.iloc[-1]['Status']
UPDATE In case you don't want to create a new dataframe, you can do it with a simple try statement
try:
return df[(df['Device']==device) & (df['Date']<=check_date)].sort_values('Date').iloc[-1]['Status']
except IndexError:
return 'OFF'
Upvotes: 1
Reputation: 863481
In my opinion there has to be define all possible dates for select in date_range
which is used for reindex
of pivot
ed DataFrame. NaN
s are replaced first by forward filling and all first NaN
values to OFF
by fillna
:
print (df)
Date Device Status
0 1990/01 50 ON
1 1990/01 20 ON
2 1990/03 25 ON
3 1990/05 50 OFF
4 1990/05 20 OFF <-changed for smaller output df
df['Date'] = pd.to_datetime(df['Date'])
rng = pd.date_range('1989-10-01', '1991-01-01', freq='MS')
df = df.pivot('Date','Device','Status').reindex(rng).ffill().fillna('OFF')
print (df)
Device 20 25 50
1989-10-01 OFF OFF OFF
1989-11-01 OFF OFF OFF
1989-12-01 OFF OFF OFF
1990-01-01 ON OFF ON
1990-02-01 ON OFF ON
1990-03-01 ON ON ON
1990-04-01 ON ON ON
1990-05-01 OFF ON OFF
1990-06-01 OFF ON OFF
1990-07-01 OFF ON OFF
1990-08-01 OFF ON OFF
1990-09-01 OFF ON OFF
1990-10-01 OFF ON OFF
1990-11-01 OFF ON OFF
1990-12-01 OFF ON OFF
1991-01-01 OFF ON OFF
Last if need original format of dates
add strftime
:
df.index = df.index.strftime('%Y/%m')
print (df)
Device 20 25 50
1989/10 OFF OFF OFF
1989/11 OFF OFF OFF
1989/12 OFF OFF OFF
1990/01 ON OFF ON
1990/02 ON OFF ON
1990/03 ON ON ON
1990/04 ON ON ON
1990/05 OFF ON OFF
1990/06 OFF ON OFF
1990/07 OFF ON OFF
1990/08 OFF ON OFF
1990/09 OFF ON OFF
1990/10 OFF ON OFF
1990/11 OFF ON OFF
1990/12 OFF ON OFF
1991/01 OFF ON OFF
EDIT:
More general solution:
def get_status(df, device, check_date):
check_date = pd.to_datetime(check_date)
df['Date'] = pd.to_datetime(df['Date'])
rng = pd.date_range(df['Date'].min(), df['Date'].max(), freq='MS')
df = df.pivot('Date','Device','Status').reindex(rng).ffill().fillna('OFF')
#print (df)
if check_date < df.index.min():
return 'OFF'
elif check_date > df.index.max():
return df.loc[df.index[-1], device]
else:
return df.loc[check_date, device]
print (get_status(df, 50, '1990/01'))
#ON
print (get_status(df, 50, '1990/02'))
#ON
print (get_status(df, 50, '1990/05'))
#OFF
print (get_status(df, 50, '1990/09'))
#OFF
print (get_status(df, 50, '1900/01'))
#OFF
Upvotes: 1