Reputation: 553
Based on this example data :
data = """value
"2020-03-02" 2
"2020-03-03" 4
"2020-03-01" 3
"2020-03-04" 0
"2020-03-08" 0
"2020-03-06" 0
"2020-03-07" 2"""
value
by date as datetime indexvalue
column i compute a new cum_value
cumulated value column;vc{i from 0 to n}
of value_cum
, vc'{j from 0 to i}
cutted series of cum_value
the row which verify and maximise the ratio vc{i} / vc'{j} >= 2
At the end, i get for each day, the delta between actual day and the day which maximize the predicate. For this data, i get :
value value_cum computeValue delta
2020-03-01 3 3 NaN NaN
2020-03-02 2 5 NaN NaN
2020-03-03 4 9 3.0 2.0
2020-03-04 0 9 3.0 2.0
2020-03-06 0 9 3.0 2.0
2020-03-07 2 11 2.2 5.0
2020-03-08 0 11 2.2 5.0
Edit : More context information here
Actually this is a code to find the first doubling day rate for Covid19 number of accumulated death. :
value
is my death by day, value_cum
is the accumulated death by day.For each day, i search into the existing series when the ratio of cumulated deaths is multiplied by 2. This is why i cut series, to compute my ratio i only need the n previous date/rows (past day) before the actual day i want to test.
I found this computation on COVID 19 our world in data charts, but i want to compute this indicators for one country and for each day and not only the last day as picture show :)
For example, for the date 2020-03-04, i only need to compute ratio between 2020-03-04 and 2020-03-01 / 02 / 03 to find the FIRST date where ratio >=2
In this example 2020-03-04 there is no more death than 2020-03-03, so we don't want to compute a new delta ( the number of days before death multiply >=2 is the same than 2020-03-03 !). I explain this in Edit1/2 archived at the end of this post.
We use a dictionary to store the first occurence of each cumulated value, so when i see that cum_value = value, i search in the dictionary to get the correct date (9 return 2020-03-03) for ratio computation.
Here my actual working code to do that :
import pandas as pd
import io
from dfply import *
data = """value
"2020-03-02" 2
"2020-03-03" 4
"2020-03-01" 3
"2020-03-04" 0
"2020-03-08" 0
"2020-03-06" 0
"2020-03-07" 2"""
df = pd.read_table(io.StringIO(data), delim_whitespace=True)
df.index = pd.to_datetime(df.index)
def f(x, **kwargs):
# get numerical index of row
numericIndex = kwargs["df"].index.get_loc(x.name)
dict_inverted = kwargs["dict"]
# Skip the first line, returning Nan
if numericIndex == 0:
return np.NaN, np.NaN
# If value_cum is the same than the previous row (nothing changed),
# we need some tweaking (compute using the datebefore) to return same data
ilocvalue = kwargs["df"].iloc[[numericIndex - 1]]["value_cum"][0]
if x['value_cum'] == ilocvalue:
name = dict_inverted[x['value_cum']]
else:
name = x.name
# Series to compare with actual row
series = kwargs["value_cum"]
# Cut this series by taking in account only the days before actual date
cutedSeries = series[series.index < name]
rowValueToCompare = float(x['value_cum'])
# User query to filter rows
# https://stackoverflow.com/questions/40171498/is-there-a-query-method-or-similar-for-pandas-series-pandas-series-query
result = cutedSeries.to_frame().query(f'({rowValueToCompare} / value_cum) >= 2.0')
# If empty return Nan
if result.empty:
return np.NaN, np.NaN
# Get the last result
oneResult = result.tail(1).iloc[:, 0]
# Compute values to return
value = (rowValueToCompare/oneResult.values[0])
idx = oneResult.index[0]
# Delta between the actual row day, and the >=2 day
delta = name - idx
# return columns
return value, delta.days
df_cases = df >> arrange(X.index, ascending=True) \
>> mutate(value_cum=cumsum(X.value))
df_map_value = df_cases.drop_duplicates(["value_cum"])
dict_value = df_map_value["value_cum"].to_dict()
dict_value_inverted = {v: k for k, v in dict_value.items()}
print(dict_value_inverted)
df_cases[["computeValue", "delta"]] = df_cases.apply(f, result_type="expand", dict=dict_value_inverted, df=df_cases, value_cum= df_cases['value_cum'],axis=1)
print(df_cases)
I'm not really happy with this code, i found that passing the entire DF to my apply method was weird.
I'm sure there is some better code in Panda to do that in less lines, and more elegantly, using probably nested apply method, but i don't found how.
The dictionnary method to store date of the first duplicate is also weird, i don't know if it's possible to do that using apply (reusing result of previous computation during apply) or if the only way was to write a recursive function.
QUESTION UPDATED WITH EDIT 1/2/3, WORKING WITH DUPLICATE VALUES
EDIT ARCHIVED
Edit 1 :
data = """value
"2020-03-02" 1
"2020-03-03" 0
"2020-03-01" 1
"2020-03-04" 0
"2020-03-05" 4"""
I see that my code doesn't take in account when there is value equal at zero.
value value_cum computeValue delta
2020-03-01 1 1 NaN NaN
2020-03-02 1 2 2.0 1.0
2020-03-03 0 2 2.0 2.0
2020-03-04 0 2 2.0 3.0
2020-03-05 4 6 3.0 1.0
2020-03-03 computeValue is equal to 3.0 and not 2.0, dela is equal to 2.0 days and not 1.0 days (like 2020-03-02)
I cannot access previous values during apply computation, so i search another way to do that.
Edit 2 :
Found a way passing a pre-computed dictionnary :
df_map_value = df_cases.drop_duplicates(["value_cum"])
dict_value = df_map_value["value_cum"].to_dict()
dict_value_inverted = {v: k for k, v in dict_value.items()}
print(dict_value_inverted)
Now, when i found a cum_value equal to some value, i return the index used for computation.
Upvotes: 3
Views: 298
Reputation: 2657
Some points
The example you gave is a bit simple and I believe make it a bit harder to think in a more generic case. I then generated random data for 30 days using numpy.
By seeing the link you sent, I think they're showing us "how many days is the latest day that current day is double of apart from current_day".
To show this explicitly I will use very verbose column names in pandas and
before calculating the metrics you want, I will build in the dataframe a reference list called days_current_day_is_double_of
wich will for each row(day) calculate a list of days which the current deaths_cum is double of the day deaths_cum.
This column later can be substituted for a simple np.where() operation every time you want to find this for a row, if you don't want to keep a reference list in the dataframe. I think it's clearer keeping it.
generating data
import pandas as pd
import numpy as np
import io
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
#n_of_days = 30
#random_data = np.random.randint(0,100,size=n_of_days)
#date_range = pd.date_range(start="2020-03-02",freq="D",periods=n_of_days)
#random_data = pd.DataFrame({"deaths":random_data})
#random_data.index = pd.to_datetime(date_range)
#df= random_data
import requests
import json
response = requests.get("https://api-covid.unthinkingdepths.fr/covid19/ecdc?type=cum")
data = json.loads(response.text)["data"]
deaths_cums = [x["deaths_cum"] for x in data]
dates = [x["dateRep"] for x in data]
df = pd.DataFrame({"deaths_cum":deaths_cums})
df.index = pd.to_datetime(dates)
A verbose solution in pandas
The key here is :
using apply() to iterate over columns
use np.where to do backwards search explicitly
I use np.where inside the helper function check_condition(row)
to create the days references once and then use find_index(list_of_days, idx)
to search again anytime
big picture of the code
# create helper functions
def check_condition(row):
+--- 7 lines: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
def delta_fromlast_day_currDay_is_double_of(row):
+--- 12 lines: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
def how_many_days_fromlast_day_currDay_is_double_of(row):
+--- 11 lines: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
def find_index(list_of_days,index):
+--- 4 lines: {-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# use apply here with lambda functions
+--- 23 lines: df['deaths_cum'] = np.cumsum(df['deaths'])------------------------------------------------------------------------------------------------------------------------------------------------
print(df)
Full solution code
def check_condition(row):
row_idx = df.index.get_loc(row.name)
currRow_deaths_cum = df.iloc[row_idx]['deaths_cum']
rows_before_current_deaths_cum = df.iloc[:row_idx]['deaths_cum']
currRow_is_more_thanDobuleOf = np.where((currRow_deaths_cum/rows_before_current_deaths_cum) >= 2)[0]
return currRow_is_more_thanDobuleOf
def delta_fromlast_day_currDay_is_double_of(row):
row_idx = df.index.get_loc(row.name)
currRow_deaths_cum = df.iloc[row_idx]['deaths_cum']
list_of_days = df.iloc[row_idx]['days_current_day_is_double_of']
last_day_currDay_is_double_of = find_index(list_of_days,-1)
if last_day_currDay_is_double_of is np.nan:
delta = np.nan
else:
last_day_currDay_is_double_of_deaths_cum = df.iloc[last_day_currDay_is_double_of]["deaths_cum"]
delta = currRow_deaths_cum - last_day_currDay_is_double_of_deaths_cum
return delta
def how_many_days_fromlast_day_currDay_is_double_of(row):
row_idx = df.index.get_loc(row.name)
list_of_days = df.iloc[row_idx]['days_current_day_is_double_of']
last_day_currDay_is_double_of = find_index(list_of_days,-1)
if last_day_currDay_is_double_of is np.nan:
delta = np.nan
else:
delta = row_idx - last_day_currDay_is_double_of
return delta
def find_index(list_of_days,index):
if list_of_days.any(): return list_of_days[index]
else: return np.nan
# use apply here with lambda functions
#df['deaths_cum'] = np.cumsum(df['deaths'])
df['deaths_cum_ratio_from_day0'] = df['deaths_cum'].apply(
lambda cum_deaths: cum_deaths/df['deaths_cum'].iloc[0]
if df['deaths_cum'].iloc[0] != 0
else np.nan
)
#df['increase_in_deaths_cum'] = df['deaths_cum'].diff().cumsum() <- this mmight be interesting for you to use for other analyses
df['days_current_day_is_double_of'] = df.apply(
lambda row:check_condition(row),
axis=1
)
df['first_day_currDay_is_double_of'] = df['days_current_day_is_double_of'].apply(lambda list_of_days: find_index(list_of_days,0))
df['last_day_currDay_is_double_of'] = df['days_current_day_is_double_of'].apply(lambda list_of_days: find_index(list_of_days,-1))
df['delta_fromfirst_day'] = df['deaths_cum'] - df['deaths_cum'].iloc[0]
df['delta_fromlast_day_currDay_is_double_of'] = df.apply(
lambda row: delta_fromlast_day_currDay_is_double_of(row),
axis=1
)
df['how_many_days_fromlast_day_currDay_is_double_of'] = df.apply(
lambda row: how_many_days_fromlast_day_currDay_is_double_of(row),
axis=1
)
print(df[-30:])
PANDAS SOLUTION OUTPUT
deaths_cum deaths_cum_ratio_from_day0 \
2020-03-22 562 NaN
2020-03-23 674 NaN
2020-03-24 860 NaN
2020-03-25 1100 NaN
2020-03-26 1331 NaN
2020-03-27 1696 NaN
2020-03-28 1995 NaN
2020-03-29 2314 NaN
2020-03-30 2606 NaN
2020-03-31 3024 NaN
2020-04-01 3523 NaN
2020-04-02 4032 NaN
2020-04-03 4503 NaN
2020-04-04 6507 NaN
2020-04-05 7560 NaN
2020-04-06 8078 NaN
2020-04-07 8911 NaN
2020-04-08 10328 NaN
2020-04-09 10869 NaN
2020-04-10 12210 NaN
2020-04-11 13197 NaN
2020-04-12 13832 NaN
2020-04-13 14393 NaN
2020-04-14 14967 NaN
2020-04-15 15729 NaN
2020-04-16 17167 NaN
2020-04-17 17920 NaN
2020-04-18 18681 NaN
2020-04-19 19323 NaN
2020-04-20 19718 NaN
days_current_day_is_double_of \
2020-03-22 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-03-23 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-03-24 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-03-25 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-03-26 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-03-27 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-03-28 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-03-29 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-03-30 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-03-31 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-01 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-02 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-03 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-04 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-05 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-06 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-07 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-08 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-09 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-10 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-11 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-12 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-13 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-14 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-15 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-16 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-17 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-18 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-19 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-20 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
first_day_currDay_is_double_of last_day_currDay_is_double_of \
2020-03-22 0.0 79.0
2020-03-23 0.0 79.0
2020-03-24 0.0 80.0
2020-03-25 0.0 81.0
2020-03-26 0.0 82.0
2020-03-27 0.0 83.0
2020-03-28 0.0 84.0
2020-03-29 0.0 85.0
2020-03-30 0.0 85.0
2020-03-31 0.0 86.0
2020-04-01 0.0 87.0
2020-04-02 0.0 88.0
2020-04-03 0.0 88.0
2020-04-04 0.0 91.0
2020-04-05 0.0 92.0
2020-04-06 0.0 93.0
2020-04-07 0.0 93.0
2020-04-08 0.0 94.0
2020-04-09 0.0 94.0
2020-04-10 0.0 94.0
2020-04-11 0.0 95.0
2020-04-12 0.0 95.0
2020-04-13 0.0 95.0
2020-04-14 0.0 95.0
2020-04-15 0.0 96.0
2020-04-16 0.0 97.0
2020-04-17 0.0 98.0
2020-04-18 0.0 98.0
2020-04-19 0.0 98.0
2020-04-20 0.0 98.0
delta_fromfirst_day delta_fromlast_day_currDay_is_double_of \
2020-03-22 562 318.0
2020-03-23 674 430.0
2020-03-24 860 488.0
2020-03-25 1100 650.0
2020-03-26 1331 769.0
2020-03-27 1696 1022.0
2020-03-28 1995 1135.0
2020-03-29 2314 1214.0
2020-03-30 2606 1506.0
2020-03-31 3024 1693.0
2020-04-01 3523 1827.0
2020-04-02 4032 2037.0
2020-04-03 4503 2508.0
2020-04-04 6507 3483.0
2020-04-05 7560 4037.0
2020-04-06 8078 4046.0
2020-04-07 8911 4879.0
2020-04-08 10328 5825.0
2020-04-09 10869 6366.0
2020-04-10 12210 7707.0
2020-04-11 13197 6690.0
2020-04-12 13832 7325.0
2020-04-13 14393 7886.0
2020-04-14 14967 8460.0
2020-04-15 15729 8169.0
2020-04-16 17167 9089.0
2020-04-17 17920 9009.0
2020-04-18 18681 9770.0
2020-04-19 19323 10412.0
2020-04-20 19718 10807.0
how_many_days_fromlast_day_currDay_is_double_of
2020-03-22 3.0
2020-03-23 4.0
2020-03-24 4.0
2020-03-25 4.0
2020-03-26 4.0
2020-03-27 4.0
2020-03-28 4.0
2020-03-29 4.0
2020-03-30 5.0
2020-03-31 5.0
2020-04-01 5.0
2020-04-02 5.0
2020-04-03 6.0
2020-04-04 4.0
2020-04-05 4.0
2020-04-06 4.0
2020-04-07 5.0
2020-04-08 5.0
2020-04-09 6.0
2020-04-10 7.0
2020-04-11 7.0
2020-04-12 8.0
2020-04-13 9.0
2020-04-14 10.0
2020-04-15 10.0
2020-04-16 10.0
2020-04-17 10.0
2020-04-18 11.0
2020-04-19 12.0
2020-04-20 13.0
If you check how_many_days_fromlast_day_currDay_is_double_of
matches exactly with XDelta
from the api :)
There are so many small suggestions in case you want to really generalize your code. I don't think that's what you're looking for but I will list some:
def check_growth_condition(row, growth_factor):
....
np.where((currRow_deaths_cum/rows_before_current_deaths_cum) >= growth_factor)[0] # <----- then just change 2 by the growth factor
....
days current day is double of
to just the lastest date current day is double of, because all days before the lastest will also be double the ratio. I'll keep the first and last just for the sake of showing a "range of days".def check_growth_condition(row, growth_factor):
...
# doing backwards search with np.where
currRow_is_more_thanDoubleOf = np.where((currRow_deaths_cum/rows_before_current_deaths_cum) >= growth_factor)[0]
if currRow_is_more_thanDobuleOf.any():
return np.array([currRow_is_more_thanDobuleOf[0],currRow_is_more_thanDobuleOf[-1]]) # <------ return just first and last
else:
return currRow_is_more_thanDobuleOf # empty list
...
Note also if you want to get rid of the reference column, you just need to use np.where((currRow_deaths_cum/rows_before_current_deaths_cum) >= growth_factor)[0]
wherever I am using the check_growth_condition
function. again np.where is always doing the searching.
delta_from_any_day
instead of just subtract you pass a function as input such as np.divide
to calculate ratios or np.subtract
to calulate the deltas as I'm doing in the exampledef delta_from_any_day(row, day_idx,
column_name='deaths_cum',func=np.subtract):
row_idx = df.index.get_loc(row.name)
currRow_deaths_cum = df.iloc[row_idx][column_name]
if day_idx is np.nan:
delta = np.nan
else:
day_idx_deaths_cum = df.iloc[day_idx][column_name]
delta = func(currRow_deaths_cum, day_idx_deaths_cum)
return delta
Cleaner Pandas solution
note that we're just reusing check_growth_condition
,find_index
to do backsearching and delta_from_any_day
and to calculate the deltas. We're just reusing those three in all other helper function to calculate stuff.
def check_growth_condition(row, growth_factor):
row_idx = df.index.get_loc(row.name)
currRow_deaths_cum = df.iloc[row_idx]['deaths_cum']
rows_before_current_deaths_cum = df.iloc[:row_idx]['deaths_cum']
currRow_is_more_thanDoubleOf = np.where((currRow_deaths_cum/rows_before_current_deaths_cum) >= growth_factor)[0]
if currRow_is_more_thanDoubleOf.any():
return np.array([currRow_is_more_thanDoubleOf[0], currRow_is_more_thanDoubleOf[-1]])
else:
return currRow_is_more_thanDoubleOf # empty list
def find_index(list_of_days,index):
if list_of_days.any(): return list_of_days[index]
else: return np.nan
def delta_from_any_day(row, day_idx, column_name='deaths_cum',func=np.subtract):
row_idx = df.index.get_loc(row.name)
currRow_deaths_cum = df.iloc[row_idx][column_name]
if day_idx is np.nan:
delta = np.nan
else:
day_idx_deaths_cum = df.iloc[day_idx][column_name]
delta = func(currRow_deaths_cum, day_idx_deaths_cum)
return delta
def delta_fromlast_day_currDay_is_double_of(row):
row_idx = df.index.get_loc(row.name)
currRow_deaths_cum = df.iloc[row_idx]['deaths_cum']
list_of_days = df.iloc[row_idx]['rangeOf_days_current_day_is_double_of']
last_day_currDay_is_double_of = find_index(list_of_days,-1)
delta = delta_from_any_day(row, last_day_currDay_is_double_of, column_name="deaths_cum")
return delta
def how_many_days_fromlast_day_currDay_is_double_of(row):
row_idx = df.index.get_loc(row.name)
list_of_days = df.iloc[row_idx]['rangeOf_days_current_day_is_double_of']
last_day_currDay_is_double_of = find_index(list_of_days,-1)
delta = delta_from_any_day(row, last_day_currDay_is_double_of, column_name="day_index")
return delta
# use apply here with lambda functions
#df['deaths_cum'] = np.cumsum(df['deaths'])
#df['deaths_cum_ratio_from_day0'] = df['deaths_cum'].apply(
# lambda cum_deaths: cum_deaths/df['deaths_cum'].iloc[0]
# if df['deaths_cum'].iloc[0] != 0
# else np.nan
# )
#df['increase_in_deaths_cum'] = df['deaths_cum'].diff().cumsum() <- this mmight be interesting for you to use for other analyses
df['rangeOf_days_current_day_is_double_of'] = df.apply(
lambda row:check_growth_condition(row,2),
axis=1
)
df['first_day_currDay_is_double_of'] = df['rangeOf_days_current_day_is_double_of'].apply(lambda list_of_days: find_index(list_of_days,0))
df['last_day_currDay_is_double_of'] = df['rangeOf_days_current_day_is_double_of'].apply(lambda list_of_days: find_index(list_of_days,-1))
df['delta_fromfirst_day'] = df['deaths_cum'] - df['deaths_cum'].iloc[0]
df['delta_fromlast_day_currDay_is_double_of'] = df.apply(
lambda row: delta_fromlast_day_currDay_is_double_of(row),
axis=1
)
df['how_many_days_fromlast_day_currDay_is_double_of'] = df.apply(
lambda row: how_many_days_fromlast_day_currDay_is_double_of(row),
axis=1
)
print(df[-5:])
Clean Output
day_index deaths_cum rangeOf_days_current_day_is_double_of \
2020-04-16 107 17167 [0, 97]
2020-04-17 108 17920 [0, 98]
2020-04-18 109 18681 [0, 98]
2020-04-19 110 19323 [0, 98]
2020-04-20 111 19718 [0, 98]
first_day_currDay_is_double_of last_day_currDay_is_double_of \
2020-04-16 0.0 97.0
2020-04-17 0.0 98.0
2020-04-18 0.0 98.0
2020-04-19 0.0 98.0
2020-04-20 0.0 98.0
delta_fromfirst_day delta_fromlast_day_currDay_is_double_of \
2020-04-16 17167 9089.0
2020-04-17 17920 9009.0
2020-04-18 18681 9770.0
2020-04-19 19323 10412.0
2020-04-20 19718 10807.0
how_many_days_fromlast_day_currDay_is_double_of
2020-04-16 10.0
2020-04-17 10.0
2020-04-18 11.0
2020-04-19 12.0
2020-04-20 13.0
Upvotes: 6
Reputation: 2748
This sounds like a job for pd.merge_asof.
def track_growths(df, growth_factor=2):
df = df.sort_index().reset_index()
df['index'] = pd.to_datetime(df['index'])
df['cum_value'] = df['value'].cumsum()
merged = pd.merge_asof(df.assign(lookup=df['cum_value'] / growth_factor),
df.assign(lookup=df['cum_value'].astype(float)),
on='lookup',
suffixes=['', '_past'])
result = merged[['index', 'value', 'cum_value']]
growth = merged['cum_value'] / merged['cum_value_past']
days_since = (merged['index'] - merged['index_past']).dt.days
return result.assign(computeValue=growth, delta=days_since).set_index('index')
This has a configurable growth factor, in case you want to try something other than 2x.
track_growths(df)
# value cum_value computeValue delta
# index
# 2020-03-01 3 3 NaN NaN
# 2020-03-02 2 5 NaN NaN
# 2020-03-03 4 9 3.0 2.0
# 2020-03-04 0 9 3.0 3.0
# 2020-03-06 0 9 3.0 5.0
# 2020-03-07 2 11 2.2 5.0
# 2020-03-08 0 11 2.2 6.0
track_growths(df, 3)
# value cum_value computeValue delta
# index
# 2020-03-01 3 3 NaN NaN
# 2020-03-02 2 5 NaN NaN
# 2020-03-03 4 9 3.000000 2.0
# 2020-03-04 0 9 3.000000 3.0
# 2020-03-06 0 9 3.000000 5.0
# 2020-03-07 2 11 3.666667 6.0
# 2020-03-08 0 11 3.666667 7.0
track_growths(df, 1.5)
# value cum_value computeValue delta
# index
# 2020-03-01 3 3 NaN NaN
# 2020-03-02 2 5 1.666667 1.0
# 2020-03-03 4 9 1.800000 1.0
# 2020-03-04 0 9 1.800000 2.0
# 2020-03-06 0 9 1.800000 4.0
# 2020-03-07 2 11 2.200000 5.0
# 2020-03-08 0 11 2.200000 6.0
Detailed explanation
Starting from your original data:
df
# value
# 2020-03-01 3
# 2020-03-02 2
# 2020-03-03 4
# 2020-03-04 0
# 2020-03-06 0
# 2020-03-07 2
# 2020-03-08 0
Let's first make sure that the index is sorted, then convert it back to a normal column and parse into a datetime
. This is also a good time to add the cumulative value, which gets us through your existing prep:
df = df.sort_index().reset_index()
df['index'] = pd.to_datetime(df['index'])
df['cum_value'] = df['value'].cumsum()
df
# index value cum_value
# 0 2020-03-01 3 3
# 1 2020-03-02 2 5
# 2 2020-03-03 4 9
# 3 2020-03-04 0 9
# 4 2020-03-06 0 9
# 5 2020-03-07 2 11
# 6 2020-03-08 0 11
Now here comes the big trick, in which merge_asof
allows you to look up the half-rate rows directly:
merged = pd.merge_asof(df.assign(lookup=df['cum_value'] / 2),
df.assign(lookup=df['cum_value'].astype(float)),
on='lookup',
suffixes=['', '_past'])
merged
# index value cum_value lookup index_past value_past cum_value_past
# 0 2020-03-01 3 3 1.5 NaT NaN NaN
# 1 2020-03-02 2 5 2.5 NaT NaN NaN
# 2 2020-03-03 4 9 4.5 2020-03-01 3.0 3.0
# 3 2020-03-04 0 9 4.5 2020-03-01 3.0 3.0
# 4 2020-03-06 0 9 4.5 2020-03-01 3.0 3.0
# 5 2020-03-07 2 11 5.5 2020-03-02 2.0 5.0
# 6 2020-03-08 0 11 5.5 2020-03-02 2.0 5.0
This will perform a "backward" search to try and find a match for every row in the first DataFrame. Per the docs:
A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
Here the key is the lookup
value, which is half of cum_value
for the left (current) DataFrame, and equal to the cum_value
for the right (historical) DataFrame. If we update the docs to match this case, it would read something like this:
Select the last row in the historical DataFrame where
cum_value
is less than or equal to half the currentcum_value
.
This is exactly what you want: the most recent day in history with no more than half the case counts.
From here it is quick work to compute the derived delta
and computeValue
information and format the result.
result = merged[['index', 'value', 'cum_value']]
growth = merged['cum_value'] / merged['cum_value_past']
days_since = (merged['index'] - merged['index_past']).dt.days
result.assign(computeValue=growth, delta=days_since).set_index('index')
# value cum_value computeValue delta
# index
# 2020-03-01 3 3 NaN NaN
# 2020-03-02 2 5 NaN NaN
# 2020-03-03 4 9 3.0 2.0
# 2020-03-04 0 9 3.0 3.0
# 2020-03-06 0 9 3.0 5.0
# 2020-03-07 2 11 2.2 5.0
# 2020-03-08 0 11 2.2 6.0
Upvotes: 1
Reputation: 5741
Initialise the data:
import io
data = """value
"2020-03-02" 2
"2020-03-03" 4
"2020-03-01" 3
"2020-03-04" 0
"2020-03-08" 0
"2020-03-06" 0
"2020-03-07" 2"""
df = pd.read_table(io.StringIO(data), delim_whitespace=True)
df.index = pd.to_datetime(df.index)
df = df.sort_index()
First add the cumulative total of df['value']
as a column:
df['value_cum'] = df['value'].cumsum()
If I understand you correctly, you are looking at the growth factor of this cumulative total since its inception (i.e. its first entry; .iloc[0]
):
day_0 = df['value_cum'].iloc[0]
df['growth_factor_since_day_0'] = df['value_cum'] / day_0
Now all we need to do is to check how many days it took for it to reach >=2
:
((df['growth_factor_since_day_0'] >= 2) == False).sum()
You could specify a threshold like in the example you linked to prevent from getting an early hit (going from value 1
to 2
for example):
day_0 = df['value_cum'].loc[df['value_cum'] >= 5].min()
This will return NaN
in the df['growth_factor_since_day_0']
column in case that threshold has not been reached yet, making sure we don't get false positives.
Upvotes: 0