Reputation: 761
I'm getting a pandas issue that I need help with.
On the one hand, I have a DataFrame that looks like the following:
contributor_id timestamp edits upper_month lower_month
0 8 2018-01-01 1 2018-04-01 2018-02-01
1 26424341 2018-01-01 11 2018-04-01 2018-02-01
10 26870381 2018-01-01 465 2018-04-01 2018-02-01
22 28109145 2018-03-01 17 2018-06-01 2018-04-01
23 32769624 2018-01-01 84 2018-04-01 2018-02-01
25 32794352 2018-01-01 4 2018-04-01 2018-02-01
On the other hand, I have (available in another DF), a given index of dates:
2018-01-01, 2018-02-01, 2018-03-01, 2018-04-01, 2018-05-01, 2018-06-01, 2018-07-01, 2018-08-01, 2018-09-01, 2018-10-01, 2018-11-01, 2018-12-01.
I need to create a pd.Series that has as an index the previously shown index. The data of the pd.Series must be, for each date in the index:
if date >= lower_month and date <= upper_month, then I add a 1.
The objective is to count, per each date, the number of times that the date is between the upper and lower month values in the previous DataFrame.
The sample output pd.Series for this case would be:
2018-01-01 0
2018-02-01 5
2018-03-01 5
2018-04-01 6
2018-05-01 1
2018-06-01 1
2018-07-01 0
2018-08-01 0
2018-09-01 0
2018-10-01 0
2018-11-01 0
2018-12-01 0
Is there a fast way of doing this calculation, avoiding to traverse the first dataframe a huge amount of times?
Thank you all.
Upvotes: 4
Views: 196
Reputation: 863256
Use list comprehension with flattening for test membership between zipped columns converted to tuples and values in range, create DataFrame
and sum
in generator:
rng = pd.date_range('2018-01-01', freq='MS', periods=12)
vals = list(zip(df['lower_month'], df['upper_month']))
s = pd.Series({y: sum(y >= x1 and y <= x2 for x1, x2 in vals) for y in rng})
EDIT:
For better performance use count
method, thank you @Stef:
s = pd.Series({y: [y >= x1 and y <= x2 for x1, x2 in vals].count(True) for y in rng})
print (s)
2018-01-01 0
2018-02-01 5
2018-03-01 5
2018-04-01 6
2018-05-01 1
2018-06-01 1
2018-07-01 0
2018-08-01 0
2018-09-01 0
2018-10-01 0
2018-11-01 0
2018-12-01 0
dtype: int64
Performace:
np.random.seed(123)
def random_dates(start, end, n=10000):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s').floor('d')
d1 = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01')) + pd.offsets.MonthBegin(0)
d2 = random_dates(pd.to_datetime('2018-01-01'), pd.to_datetime('2020-01-01')) + pd.offsets.MonthBegin(0)
df = pd.DataFrame({'lower_month':d1, 'upper_month':d2})
rng = pd.date_range('2015-01-01', freq='MS', periods=6 * 12)
vals = list(zip(df['lower_month'], df['upper_month']))
In [238]: %timeit pd.Series({y: [y >= x1 and y <= x2 for x1, x2 in vals].count(True) for y in rng})
158 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [239]: %timeit pd.Series({y: sum(y >= x1 and y <= x2 for x1, x2 in vals) for y in rng})
221 ms ± 17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#first solution is slow
In [240]: %timeit pd.DataFrame([(y, y >= x1 and y <= x2) for x1, x2 in vals for y in rng], columns=['d','test']).groupby('d')['test'].sum().astype(int)
4.52 s ± 396 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 3
Reputation: 3739
I have used itertools to repeat the upper_month and lower month for each index_date
Then compare the index_date for each lower and upper_month and set temporary column
check = 1
Then sum check after group by index_date
import pandas as pd
from pandas.compat import StringIO, BytesIO
import itertools
#sample data
data = ('contributor_id,timestamp,edits,upper_month,lower_month\n'
'8,2018-01-01,1,2018-04-01,2018-02-01\n'
'26424341,2018-01-01,11,2018-04-01,2018-02-01\n'
'26870381,2018-02-01,465,2018-04-01,2018-02-01\n'
'28109145,2018-03-01,17,2018-06-01,2018-04-01\n')
orig_df = pd.read_csv(StringIO(data))
# sample index_dates
index_df = list(pd.Series(["2018-01-01", "2018-02-01"]))
# repeat upper_month and lower_month using itertools.product
abc = list(orig_df[['upper_month','lower_month']].values)
combine_list = [index_df,abc]
res = list(itertools.product(*combine_list))
df = pd.DataFrame(res,columns=["timestamp","range"])
#separate lower_month and upper_month from range
df['lower_month'] = df['range'].apply(lambda x : x[1])
df['upper_month'] = df['range'].apply(lambda x : x[0])
df.drop(['range'],axis=1,inplace=True)
# convert all dates column to make them consistent
orig_df['timestamp'] = pd.to_datetime(orig_df['timestamp']).dt.date.astype(str)
orig_df['upper_month'] = pd.to_datetime(orig_df['upper_month']).dt.date.astype(str)
orig_df['lower_month'] = pd.to_datetime(orig_df['lower_month']).dt.date.astype(str)
#apply condition to set check 1
df.loc[(df["timestamp"]>=df['lower_month']) & (df["timestamp"]<=df['upper_month']),"check"] = 1
#simply groupby to count check
res = df.groupby(['timestamp'])['check'].sum()
print(res)
timestamp
2018-01-01 0.0
2018-02-01 3.0
Upvotes: 0