Reputation: 327
I am working with a DataFrame where each row observation has an ordinal datetime object attached to it. I've written a function that I believe looks through my DataFrame and identifies consecutively occurring days and the length of run of these consecutively occurring days with the following code:
def consecutiveCount(df):
df= df.copy()
cond1 = df['DATE_INT'].shift(-1) - df['DATE_INT'] == 1
cond2 = df['DATE_INT'].shift(1) - df['DATE_INT'] == -1
cond3 = df['DATE_INT'].shift(-2) - df['DATE_INT'] == 2
cond4 = df['DATE_INT'].shift(2) - df['DATE_INT'] == -2
Now I continue making these conditions in the same fashion until the point:
cond55 = df['DATE_INT'].shift(-28) - df['DATE_INT'] == 28
cond56 = df['DATE_INT'].shift(28) - df['DATE_INT'] == -28
cond57 = df['DATE_INT'].shift(-29) - df['DATE_INT'] == 29
cond58 = df['DATE_INT'].shift(29) - df['DATE_INT'] == -29
I then write the length of the 'run' of days in a column variable with the following code:
df.loc[cond1 | cond2, 'CONSECUTIVE_COUNT'] = 2
df.loc[cond3 | cond4, 'CONSECUTIVE_COUNT'] = 3
again I continue until I reach 'runs' of days of length 30.
df.loc[cond55 | cond56, 'CONSECUTIVE_COUNT'] = 29
df.loc[cond57 | cond58, 'CONSECUTIVE_COUNT'] = 30
Finally I apply the function to particular groups of my DataFrame as follows:
df1 = df.groupby(['COUNTY_GEOID_YEAR','TEMPBIN']).apply(consecutiveCount)
I am sure there are much more efficient ways to write this code. I've identified the bottle neck in my script is when applying the function by printing various strings throughout my script.
Any help on writing the function in a more efficient manner or how to speed up applying the function would be great! Please let me know if I can provide anymore info.
EDIT: As @DSM pointed out my code is not accurately counting the length of consecutive runs of days properly. His solution did so accurately for me!
Upvotes: 1
Views: 215
Reputation: 353419
IIUC, you can you can use the shift-compare-cumsum pattern after applying your groupby, and then do a transform.
Assuming that your data looks something like this (simplifying a bit)
df = pd.DataFrame({"GEOID_YEAR": [2000]*10 + [2001]*4, "TEMPBIN": [1]*14,
"DATE_INT": [1,2,3,4,6,7,9,10,11,14] + list(range(14,18)),
"OTHER_COL": [2]*14})
or
DATE_INT GEOID_YEAR OTHER_COL TEMPBIN
0 1 2000 2 1
1 2 2000 2 1
2 3 2000 2 1
3 4 2000 2 1
4 6 2000 2 1
5 7 2000 2 1
6 9 2000 2 1
7 10 2000 2 1
8 11 2000 2 1
9 14 2000 2 1
10 14 2001 2 1
11 15 2001 2 1
12 16 2001 2 1
13 17 2001 2 1
then
df["cons_id"] = df.groupby(["GEOID_YEAR", "TEMPBIN"])["DATE_INT"].apply(
lambda x: (x != x.shift() + 1).cumsum())
df["cons_count"] = (df.groupby(["GEOID_YEAR", "TEMPBIN", "cons_id"])
["cons_id"].transform("count"))
gives us
In [78]: df
Out[78]:
DATE_INT GEOID_YEAR OTHER_COL TEMPBIN cons_id cons_count
0 1 2000 2 1 1 4
1 2 2000 2 1 1 4
2 3 2000 2 1 1 4
3 4 2000 2 1 1 4
4 6 2000 2 1 2 2
5 7 2000 2 1 2 2
6 9 2000 2 1 3 3
7 10 2000 2 1 3 3
8 11 2000 2 1 3 3
9 14 2000 2 1 4 1
10 14 2001 2 1 1 4
11 15 2001 2 1 1 4
12 16 2001 2 1 1 4
13 17 2001 2 1 1 4
Upvotes: 1