Fr3nch
Fr3nch

Reputation: 9

Efficiently generating labels for timeseries data based on future timeseries data

I have two years worth of sensor data in a pandas dataframe. The index is a time series. Looks like this:


                     temp1     overtemp  time_to_overtemp
datetime                                                    
2019-01-02 09:31:00  305.96
2019-01-02 09:32:00  305.98
2019-01-02 09:33:00  305.70
2019-01-02 09:34:00  305.30
2019-01-02 09:35:00  306.88

What I want to do is loop over the time series to populate the "overtemp" and "time_to_overtemp columns. "overtemp" needs to be assigned a 1 if the temperature data anytime in the next two weeks increases by more than 2%. "time_to_overtemp" needs to show the time of the next >2% reading, if it exists in the next two weeks. If the temperature says within 2% for the next two weeks, both columns should be assigned a 0.

For example 2019-01-02 09:31:00 should look at the next two weeks worth of temperature data and put a 0 in both columns because all data in that time period is within 2% of the value. The overtemp value for 2020-01-02 09:35:00 should be a 1 because the value increased by 5% a week later. The time_to_overtemp value should indicate 7 days, 2 hours, 38 minutes because thats when the overtemp occured.

I am successfully doing some more math stuff using iterrows:

for datetime, row in df.iterrows():

but its taking forever. And I am not getting how to do the time iterations and calculations at all yet.

I have done other labeling with:

df['overtemp'] = np.select([df['temp1']<305, df['temp1']>305], [1,0])

I guess this vectorizes the process? It sure works a lot faster than iterating. But I can't figure out how to implement the datetime+two week portion.

Upvotes: 1

Views: 150

Answers (2)

Ezer K
Ezer K

Reputation: 3739

Think you could do it with pandas rolling by applying funcs, here is my go, hope this is what you need:

# create random data
ind = pd.date_range(start = pd.to_datetime("2019-01-02 09:31:00"), periods=28*24*60, freq='min')
v = [random.randint(30000, 32000)/100 for x in ind]
df = pd.DataFrame(v, index=ind, columns = ['temp1'])

# define funcs
def overtemp(S):
    l = S[S>=S[-1]*1.02]
    if len(l)>0:
        return l[-1]
    else:
        return 0

def overtemp_seconds(S):
    l = np.argwhere(S.values>=S.values[-1]*1.02)
    if len(l)>0:
        i = l[-1][0]
        delta = S.index[i] - S.index[-1]
        return delta.seconds
    else:
        return 0

# apply funcs to time window (reversed cause rolling gives you time window till offset)
over_temp = df[::-1].rolling('14D').apply(overtemp)[::-1]['temp1']
seconds_to_over_temp = df[::-1].rolling('14D').apply(overtemp_seconds)[::-1]['temp1']

# add to orig df
df['over_temp'] = over_temp
df['seconds_to_over_temp'] = seconds_to_over_temp

Upvotes: 0

caiolopes
caiolopes

Reputation: 571

Here is a suggestion on how to approach your problem by using apply and indexing.

But it is not a very fast solution indeed, maybe pandas have a better function to do this.

Or you can parallelize the apply function to make it faster.

df = pd.DataFrame(pd.date_range(start='2020-01-01', end='2020-03-01', freq='H'), columns=['datetime'])
df['temp'] =  [np.random.uniform(low=300, high=310) for _ in range(df.shape[0])]

def get_overtemp(row):
    now = row.datetime
    next_14_days = now + timedelta(days=14)
    temp = row.temp
    
    filtered = df[
        (df['datetime'] > now) & 
        (df['datetime'] <= next_14_days) & 
        (df['temp'] > temp * 1.02)
    ]
    
    overtemp = len(filtered) > 0
    time_to_overtemp = None
    if overtemp:
        time_to_overtemp = filtered['datetime'].values[0] - now
        
    return pd.Series([overtemp, time_to_overtemp])

df[['overtemp', 'time_to_overtemp']] = df.apply(get_overtemp, axis=1)
df.head(20)

Result:

|    | datetime            |    temp | overtemp   | time_to_overtemp   |
|---:|:--------------------|--------:|:-----------|:-------------------|
|  0 | 2020-01-01 00:00:00 | 309.502 | False      | NaT                |
|  1 | 2020-01-01 01:00:00 | 303.816 | True       | 7 days 00:00:00    |
|  2 | 2020-01-01 02:00:00 | 307.297 | False      | NaT                |
|  3 | 2020-01-01 03:00:00 | 306.252 | False      | NaT                |
|  4 | 2020-01-01 04:00:00 | 303.458 | True       | 0 days 07:00:00    |
|  5 | 2020-01-01 05:00:00 | 304.27  | False      | NaT                |
|  6 | 2020-01-01 06:00:00 | 300.98  | True       | 0 days 05:00:00    |
|  7 | 2020-01-01 07:00:00 | 306.652 | False      | NaT                |
|  8 | 2020-01-01 08:00:00 | 304.107 | False      | NaT                |
|  9 | 2020-01-01 09:00:00 | 300.077 | True       | 0 days 02:00:00    |

See: https://github.com/nalepae/pandarallel

Upvotes: 1

Related Questions