time series automate labeling based on value per day

Question

I would like to use point reference data and turn it into a type that is time-series analysable. My data has start_date and end_date, which I used to calculate the duration of the event (if start and end_date on same day, return 1. Duration values range between 0 to 326).

Furthermore, I have another pd dataframe which is a flattened series of images sized x*y, which has (among others) three columns: time, lon(x), lat(y).

Now, I would like to append to the latter dataset a True boolean if the date falls inside (start_date + duration). Else, return False.

Below I have created some example data out the required output to visualise what I would like to achieve. Does someone know how to approach this issue?

Next, I would like to add true labels to the rows that fall within that duration. See the example of my data with the required output. Does somebody know how I should tackle this issue?

pd.DataFrame({'lat':[50,60], 
          'long':[-120,-110], 
          'time':[np.datetime64('2020-01-01'),np.datetime64('2020-05-06')],
          'end_time':[np.datetime64('2020-01-04'),np.datetime64('2020-05-06')],
          'duration':[4,1]})


    lat lon     time        end_time    duration
0   50  -120    2020-01-01  2020-01-04  4
x   60  -110    2020-05-06  2020-05-06  1

Required output would look like:

    lat lon     time        label
0   50  -120    2020-01-01  True
1   50  -120    2020-01-02  True
2   50  -120    2020-01-03  True
3   50  -120    2020-01-04  True
4   50  -120    2020-01-05  False
...
x-1 60  -110    2020-05-05  False
x   60  -110    2020-05-06  True
x+1 60  -110    2020-05-07  False

Edit: Below is the an example of how my data is really structured:

 df = pd.DataFrame({'time': [np.datetime64('2002-05-18'),np.datetime64('2002-05-18'),np.datetime64('2002-05-18'),np.datetime64('2002-05-18'),np.datetime64('2002-05-18')],
                   'lon': [-116.125, -115.875, -115.625, -115.375, -115.125],
                   'lat': [55.125, 55.125, 55.125, 55.125, 55.125],
                   'out_date': [np.datetime64('2002-05-19'), None, None, None, None],
                   'duration_days': [2, None, None, None, None],
                   'Fire': [True, None, None, None, None],
                   'size_ha': [0.1, None, None, None, None],
                   'fire_count': [1, None, None, None, None]})

    time        lon         lat     out_date    duration_days   Fire    size_ha fire_count
0   2002-05-18  -116.125    55.125  2002-05-19  2.0             True    0.1     1.0
1   2002-05-18  -115.875    55.125  NaT         NaN             None    NaN     NaN
2   2002-05-18  -115.625    55.125  NaT         NaN             None    NaN     NaN
3   2002-05-18  -115.375    55.125  NaT         NaN             None    NaN     NaN
4   2002-05-18  -115.125    55.125  NaT         NaN             None    NaN     NaN

Each day has 42x46 combinations of lon, lat, which are passed before going to the next day. In the newly added table you see a forest fire occured on 'time': 2002-05-18 with coordinates x and y has a duration of 2. I would like to see that if I would now go 1932 (42x46) rows down to 'time': 2002-05-19 to see the column 'fire' (which is a label) updated to True. Say we group by lon and lat, the data would look the data in the 'required output' example dataframe.

jezrael · Accepted Answer

First convert both columns to datetimes and add one day to end_date, then repeat index by Index.repeat with subtraction of days and add counter values by GroupBy.cumcount with to_timedelta:

df['time'] = pd.to_datetime(df['time'])
df['end_time'] = pd.to_datetime(df['end_time']) + pd.Timedelta(1, unit='d')

s = df.pop('end_time').sub(df['time']).dt.days
df = df.loc[df.index.repeat(s)].copy()
counter = df.groupby(level=0).cumcount()
df['time'] = df['time'].add(pd.to_timedelta(counter, unit='d'))
df = df.reset_index(drop=True)
print (df)
   lat  long       time  duration
0   50  -120 2020-01-01         4
1   50  -120 2020-01-02         4
2   50  -120 2020-01-03         4
3   50  -120 2020-01-04         4
4   60  -110 2020-05-06         1

And then merge by another DataFrame with left join and indicator parameter, last compare by both:

df1 = pd.DataFrame({'time': [np.datetime64('2020-01-03'),np.datetime64('2002-05-18'),
                             np.datetime64('2002-05-18'),np.datetime64('2002-05-18'),
                             np.datetime64('2002-05-18')],
                   'lon': [-120, -115.875, -115.625, -115.375, -115.125],
                   'lat': [50, 55.125, 55.125, 55.125, 55.125],
                   'out_date': [np.datetime64('2002-05-19'), None, None, None, None],
                   'duration_days': [2, None, None, None, None],
                   'Fire': [True, None, None, None, None],
                   'size_ha': [0.1, None, None, None, None],
                   'fire_count': [1, None, None, None, None]})

df = (df1.rename(columns={'lon':'long'})
         .merge(df, 
                on=['lat','long','time'], 
                indicator='label', 
                how='left'))
df['label'] = df['label'].eq('both')
print (df)
        time     long     lat   out_date  duration_days  Fire  size_ha  \
0 2020-01-03 -120.000  50.000 2002-05-19            2.0  True      0.1   
1 2002-05-18 -115.875  55.125        NaT            NaN  None      NaN   
2 2002-05-18 -115.625  55.125        NaT            NaN  None      NaN   
3 2002-05-18 -115.375  55.125        NaT            NaN  None      NaN   
4 2002-05-18 -115.125  55.125        NaT            NaN  None      NaN   

   fire_count  duration  label  
0         1.0       4.0   True  
1         NaN       NaN  False  
2         NaN       NaN  False  
3         NaN       NaN  False  
4         NaN       NaN  False

EDIT: If error:

ValueError: repeats may not contain negative values

It means some row with end_time is lower like time column:

df = pd.DataFrame({'lat':[50,60], 
          'long':[-120,-110], 
          'time':[np.datetime64('2020-01-01'),np.datetime64('2020-05-06')],
          'end_time':[np.datetime64('2020-01-04'),np.datetime64('2020-05-04')],
          'duration':[4,1]})

df['time'] = pd.to_datetime(df['time'])
df['end_time'] = pd.to_datetime(df['end_time']) 
print (df)
   lat  long       time   end_time  duration
0   50  -120 2020-01-01 2020-01-04         4
1   60  -110 2020-05-06 2020-05-04         1 end_time  < time

Possible solution is swap these values:

m = df['time'].gt(df['end_time'])
d = {'end_time':'time','time':'end_time'}
df.loc[m, ['time','end_time']] = df.loc[m, ['end_time','time']].rename(columns=d)

df['end_time'] = df['end_time']  + pd.Timedelta(1, unit='d')
print (df)
   lat  long       time   end_time  duration
0   50  -120 2020-01-01 2020-01-05         4
1   60  -110 2020-05-04 2020-05-07         1

And then use solution above:

s = df.pop('end_time').sub(df['time']).dt.days
df = df.loc[df.index.repeat(s)].copy()
counter = df.groupby(level=0).cumcount()
df['time'] = df['time'].add(pd.to_timedelta(counter, unit='d'))
df = df.reset_index(drop=True)
print (df)
   lat  long       time  duration
0   50  -120 2020-01-01         4
1   50  -120 2020-01-02         4
2   50  -120 2020-01-03         4
3   50  -120 2020-01-04         4
4   60  -110 2020-05-04         1
5   60  -110 2020-05-05         1
6   60  -110 2020-05-06         1

time series automate labeling based on value per day

Answers (1)

Related Questions