spokati
spokati

Reputation: 55

Counting consecutive days of temperature data

So I have some sea surface temperature anomaly data. These data have been filtered down so that these are the values that are below a certain threshold. However, I am trying to identify cold spells - that is, to isolate events that last longer than 5 consecutive days. A sample of my data is below (I've been working between xarray datasets/dataarrays and pandas dataframes). Note, the 'day' is the day number of the month I am looking at (eventually will be expanded to the whole year). I have been scouring SO/the internet for ways to extract these 5-day-or-longer events based on the 'day' column, but I haven't gotten anything to work. I'm still relatively new to coding so my first thought was looping over the rows of the 'day' column but I'm not sure. Any insight is appreciated.

Here's what some of my data look like as a pandas df:

        lat     lon     time        day  ssta
5940    24.125  262.375 1984-06-03  3   -1.233751
21072   24.125  262.375 1984-06-04  4   -1.394495
19752   24.125  262.375 1984-06-05  5   -1.379742
10223   24.125  262.375 1984-06-27  27  -1.276407
47355   24.125  262.375 1984-06-28  28  -1.840763
... ... ... ... ... ...
16738   30.875  278.875 2015-06-30  30  -1.345640
3739    30.875  278.875 2020-06-16  16  -1.212824
25335   30.875  278.875 2020-06-17  17  -1.446407
41891   30.875  278.875 2021-06-01  1   -1.714249
27740   30.875  278.875 2021-06-03  3   -1.477497

64228 rows × 5 columns

As a filtered xarray:

xarray.Dataset
Dimensions:  lat: 28, lon: 68, time: 1174
Coordinates:  
time (time)  datetime64[ns]   1982-06-01 ... 2021-06-04
lon (lon) float32 262.1 262.4 262.6 ... 278.6 278.9
lat (lat) float32 24.12 24.38 24.62 ... 30.62 30.88
day (time) int64 1 2 3 4 5 6 7 ... 28 29 30 1 2 3 4
Data variables:
ssta (time, lat, lon) float32 nan nan nan nan ... nan nan nan nan
Attributes: (0)

TLDR; I want to identify (and retain the information of) events that are 5+ consecutive days, ie if there were a day 3 through day 8, or day 21 through day 30, etc.

Upvotes: 1

Views: 1415

Answers (2)

sitting_duck
sitting_duck

Reputation: 3720

You can pull the day ranges of the spells using this approach:

min_spell_days = 6
days = {'day': [1,2,5,6,7,8,9,10,17,19,21,22,23,24,25,26,27,31]}
df = pd.DataFrame(days)

Find number of days between consecutive entries:

diff = df['day'].diff()

Mark the last day of a spell:

df['last'] = (diff == 1) & (diff.shift(-1) > 1)

Accumulate the number of days in each spell:

df['diff0'] = np.where(diff > 1, 0, diff)
df['cs'] = df['diff0'].eq(0).cumsum()
df['spell_days'] = df.groupby('cs')['diff0'].transform('cumsum')

Mark the last entry as the last day of a spell if applicable:

if diff.iat[-1] == 1:
    df['last'].iat[-1] = True

Select the last day of all qualifying spells:

df_spells = (df[df['last'] & (df['spell_days'] >= (min_spell_days-1))]).copy()

Identify the start, end and duration of each spell:

df_spells['end_day'] = df_spells['day']
df_spells['start_day'] = (df_spells['day'] - df['spell_days'])
df_spells['spell_days'] = df['spell_days'] + 1

Resulting df:

df_spells[['start_day','end_day','spell_days']].astype('int')

    start_day  end_day  spell_days
7           5       10           6
16         21       27           7

Also, using date arithmetic 'day' you could represent a serial day number relative to some base date - like 1/1/1900. That way spells that span month and year boundaries could be handled. It would then be trivial to convert back to a date using date arithmetic and that serial number.

Upvotes: 1

Lukas S
Lukas S

Reputation: 3603

I think rather than filtering your original data you should try to do it the pandas way which in this case means obtain a series with true false values depending on your condition.

Your data seems not to include temperatures so here is my example:

import pandas as pd
import numpy as np

df = pd.DataFrame(data={'temp':np.random.randint(10,high=40,size=64228,dtype='int64')})

Will generate a DataFrame with a single column containing random temperatures between 10 and 40 degrees. Notice that I can just work with the auto generated index but you might have to switch it to a column like time or date or something like that using .set_index. Say we are interested in the consecutive days with more than 30 degrees.

is_over_30 = df['temp'] > 30

will give us a True/False array with that information. Notice that this format is very useful since we can index with it. E.g. df[is_over_30] will give us the rows of the dataframe for days where the temperature is over 30 deg. Now we wanna shift the True/False values in is_over_30 one spot forward and generate a new series that is true if both are true like so

is_over_30 & np.roll(is_over_30, -1) 

Basically we are done here and could write 3 more of those & rolls. But there is a way to write it more concise.

from functools import reduce

is_consecutively_over_30 = reduce(lambda a,b: a&b, [np.roll(is_over_30, -i) for i in range(5)])

Keep in mind that that even though the last 4 days can't be consecutively over 30 deg this might still happen here since roll shifts the first values into the position relevant for that. But you can just set the last 4 values to False to resolve this.

is_consecutively_over_30[-4:] = False

Upvotes: 2

Related Questions