Justin
Justin

Reputation: 327

Identifying Consecutive Sequences of Data and Counting their Lengths

I am working with a DataFrame where each row observation has an ordinal datetime object attached to it. I've written a function that I believe looks through my DataFrame and identifies consecutively occurring days and the length of run of these consecutively occurring days with the following code:

 def consecutiveCount(df):
    df= df.copy()
    cond1 = df['DATE_INT'].shift(-1) - df['DATE_INT'] == 1
    cond2 = df['DATE_INT'].shift(1) - df['DATE_INT'] == -1

    cond3 = df['DATE_INT'].shift(-2) - df['DATE_INT'] == 2            
    cond4 = df['DATE_INT'].shift(2) - df['DATE_INT'] == -2

Now I continue making these conditions in the same fashion until the point:

    cond55 = df['DATE_INT'].shift(-28) - df['DATE_INT'] == 28            
    cond56 = df['DATE_INT'].shift(28) - df['DATE_INT'] == -28

    cond57 = df['DATE_INT'].shift(-29) - df['DATE_INT'] == 29            
    cond58 = df['DATE_INT'].shift(29) - df['DATE_INT'] == -29

I then write the length of the 'run' of days in a column variable with the following code:

    df.loc[cond1 | cond2, 'CONSECUTIVE_COUNT'] = 2
    df.loc[cond3 | cond4, 'CONSECUTIVE_COUNT'] = 3

again I continue until I reach 'runs' of days of length 30.

    df.loc[cond55 | cond56, 'CONSECUTIVE_COUNT'] = 29
    df.loc[cond57 | cond58, 'CONSECUTIVE_COUNT'] = 30

Finally I apply the function to particular groups of my DataFrame as follows:

    df1 = df.groupby(['COUNTY_GEOID_YEAR','TEMPBIN']).apply(consecutiveCount)

I am sure there are much more efficient ways to write this code. I've identified the bottle neck in my script is when applying the function by printing various strings throughout my script.

Any help on writing the function in a more efficient manner or how to speed up applying the function would be great! Please let me know if I can provide anymore info.

EDIT: As @DSM pointed out my code is not accurately counting the length of consecutive runs of days properly. His solution did so accurately for me!

Upvotes: 1

Views: 215

Answers (1)

DSM
DSM

Reputation: 353419

IIUC, you can you can use the shift-compare-cumsum pattern after applying your groupby, and then do a transform.

Assuming that your data looks something like this (simplifying a bit)

df = pd.DataFrame({"GEOID_YEAR": [2000]*10 + [2001]*4, "TEMPBIN": [1]*14,
                   "DATE_INT": [1,2,3,4,6,7,9,10,11,14] + list(range(14,18)),
                   "OTHER_COL": [2]*14})

or

    DATE_INT  GEOID_YEAR  OTHER_COL  TEMPBIN
0          1        2000          2        1
1          2        2000          2        1
2          3        2000          2        1
3          4        2000          2        1
4          6        2000          2        1
5          7        2000          2        1
6          9        2000          2        1
7         10        2000          2        1
8         11        2000          2        1
9         14        2000          2        1
10        14        2001          2        1
11        15        2001          2        1
12        16        2001          2        1
13        17        2001          2        1

then

df["cons_id"] = df.groupby(["GEOID_YEAR", "TEMPBIN"])["DATE_INT"].apply(
    lambda x: (x != x.shift() + 1).cumsum())
df["cons_count"] = (df.groupby(["GEOID_YEAR", "TEMPBIN", "cons_id"])
                    ["cons_id"].transform("count"))

gives us

In [78]: df
Out[78]: 
    DATE_INT  GEOID_YEAR  OTHER_COL  TEMPBIN  cons_id  cons_count
0          1        2000          2        1        1           4
1          2        2000          2        1        1           4
2          3        2000          2        1        1           4
3          4        2000          2        1        1           4
4          6        2000          2        1        2           2
5          7        2000          2        1        2           2
6          9        2000          2        1        3           3
7         10        2000          2        1        3           3
8         11        2000          2        1        3           3
9         14        2000          2        1        4           1
10        14        2001          2        1        1           4
11        15        2001          2        1        1           4
12        16        2001          2        1        1           4
13        17        2001          2        1        1           4

Upvotes: 1

Related Questions