Reputation: 1119
I've got a problem that I can't solve in Python (I've previously done this in SQL and I'm not quite as good at Python as I am at SQL)
This is an example of my data:
desc date_1 date_2 date_3 values
54287171 cc-cc 2018-03-14 2017-07-03 2018-05-21 55
49410141 other-dd 2012-01-18 2017-01-26 2011-12-30 17
37694577 other-dd 2018-07-05 2017-07-25 2018-06-19 9
54051782 other-cc 2014-10-23 2017-11-24 2014-10-31 37
7378464 dd-cc 2016-08-05 2018-05-15 2016-07-22 92
29665541 dd-cc 2011-12-14 2017-08-01 2012-05-01 40
2999878 dd-cc 2018-10-03 2018-04-13 2018-09-17 37
39453869 cc-cc 2015-11-24 2017-09-09 2015-11-21 81
7181109 dd-dd 2018-01-18 2017-11-24 2018-01-15 27
29580865 dd-cc 2017-04-24 2017-09-07 2017-05-04 38
14778957 other-cc 2017-11-02 2017-06-20 2018-06-26 49
32500886 cc-dd 2017-01-12 2017-05-26 2017-01-12 50
52146154 other-cc 2018-08-01 2017-03-27 2018-07-16 5
7208584 cc-dd 2018-03-13 2018-07-04 2018-04-26 8
35894666 cc-cc 2017-12-04 2018-06-13 2018-08-14 88
27565108 other-other 2015-10-19 2017-03-14 2016-01-22 88
50705834 other-cc 2018-01-08 2017-12-09 2018-01-11 62
45420360 dd-cc 2017-10-23 2017-09-02 2018-01-29 52
55933497 dd-cc 2017-04-14 2018-06-07 2017-09-27 36
46160680 dd-cc 2014-06-05 2018-01-16 2016-01-27 87
In brief, I'm trying to recreate is this functionality:
SUM(CASE
WHEN date_1 <= date_2 - interval '11' month
AND date_3 > date_2 - interval '11' month
THEN values
end)
But then also group by the desc column.
So I'm trying to develop a logic that will create an interval between dates and sum all of the values within that interval. To give further context, I'm trying to achieve two things:
An example output should essentially look like this:
desc interval values_sum
cc-cc at_date 55
cc-dd at_date 17
other-dd at_date 9
cc-cc date_minus_1 37
cc-dd date_minus_1 92
other-dd date_minus_1 40
cc-cc date_minus_2 37
Upvotes: 0
Views: 709
Reputation: 1119
Ok, so using the answer below and the assistance from someone at work I've tried a few different options, this is the most succinct solution we've come up with.
from dateutil.relativedelta import relativedelta
for i in np.arange(-12,12,1):
df['Month_' + str(i)] = df.apply(lambda x: x['values']
if (x['date_2'] <= x['date_1'] + relativedelta(months=i)) \
& (x['date_3'] > x['date_2'] + relativedelta(months=i))
else 0, axis=1)
Then the final piece was just a simple groupby on two fields, aggregating by sum. i.e. gf.groupby(['field_1','field_2']).sum()
Upvotes: 0
Reputation: 2158
I hope I understand your question correctly.
groupby()
groups by properties for one or more columns. You can group by "date_2" and/or by "desc" and/or by any other columns you like.numpy.timedelta64(11, 'M')
. A potential problem is that the timedelta function resolves this to a generic time distance and does not retain the denomination in months. This may be problematic because different months are not equally long. If you only care about months, consider storing only the months from some reference time.
A script to provide an example:
""" Create an example dataset """
import numpy as np
import pandas as pd
df = pd.DataFrame(columns=["desc", "date_1", "date_2", "date_3", "values"])
df.loc["54287171"] = ["cc-cc", pd.Timestamp("2018-03-14"), pd.Timestamp("2017-07-03"), pd.Timestamp("2018-05-21"), 55]
df.loc["49410141"] = ["other-dd", pd.Timestamp("2012-01-18"), pd.Timestamp("2017-01-26"), pd.Timestamp("2011-12-30"), 17]
df.loc["37694577"] = ["other-dd", pd.Timestamp("2018-07-05"), pd.Timestamp("2017-07-25"), pd.Timestamp("2018-06-19"), 9]
df.loc["54051782"] = ["other-cc", pd.Timestamp("2014-10-23"), pd.Timestamp("2017-11-24"), pd.Timestamp("2014-10-31"), 37]
df.loc["7378464"] = ["dd-cc", pd.Timestamp("2016-08-05"), pd.Timestamp("2018-05-15"), pd.Timestamp("2016-07-22"), 92]
df.loc["29665541"] = ["dd-cc", pd.Timestamp("2011-12-14"), pd.Timestamp("2017-08-01"), pd.Timestamp("2012-05-01"), 40]
df.loc["2999878"] = ["dd-cc", pd.Timestamp("2018-10-03"), pd.Timestamp("2018-04-13"), pd.Timestamp("2018-09-17"), 37]
df.loc["39453869"] = ["cc-cc", pd.Timestamp("2015-11-24"), pd.Timestamp("2017-09-09"), pd.Timestamp("2015-11-21"), 81]
df.loc["7181109"] = ["dd-dd", pd.Timestamp("2018-01-18"), pd.Timestamp("2017-11-24"), pd.Timestamp("2018-01-15"), 27]
df.loc["29580865"] = ["dd-cc", pd.Timestamp("2017-04-24"), pd.Timestamp("2017-09-07"), pd.Timestamp("2017-05-04"), 38]
df.loc["14778957"] = ["other-cc", pd.Timestamp("2017-11-02"), pd.Timestamp("2017-06-20"), pd.Timestamp("2018-06-26"), 49]
df.loc["32500886"] = ["cc-dd", pd.Timestamp("2017-01-12"), pd.Timestamp("2017-05-26"), pd.Timestamp("2017-01-12"), 50]
df.loc["52146154"] = ["other-cc", pd.Timestamp("2018-08-01"), pd.Timestamp("2017-03-27"), pd.Timestamp("2018-07-16"), 5]
df.loc["7208584"] = ["cc-dd", pd.Timestamp("2018-03-13"), pd.Timestamp("2018-07-04"), pd.Timestamp("2018-04-26"), 8]
df.loc["35894666"] = ["cc-cc", pd.Timestamp("2017-12-04"), pd.Timestamp("2018-06-13"), pd.Timestamp("2018-08-14"), 88]
df.loc["50705834"] = ["other-cc", pd.Timestamp("2018-01-08"), pd.Timestamp("2017-12-09"), pd.Timestamp("2018-01-11"), 62]
df.loc["45420360"] = ["dd-cc", pd.Timestamp("2017-10-23"), pd.Timestamp("2017-09-02"), pd.Timestamp("2018-01-29"), 52]
df.loc["55933497"] = ["dd-cc", pd.Timestamp("2017-04-14"), pd.Timestamp("2018-06-07"), pd.Timestamp("2017-09-27"), 36]
df.loc["46160680"] = ["dd-cc", pd.Timestamp("2014-06-05"), pd.Timestamp("2018-01-16"), pd.Timestamp("2016-01-27"), 87]
"""Question 1: Yes, groupby() groups by properties for one or more columns"""
df.groupby(["desc"]).sum()
# values
#desc
#cc-cc 224
#cc-dd 58
#dd-cc 382
#dd-dd 27
#other-cc 153
#other-dd 26
"""Question 2: You can define conditions, save them in the dataframe, then group by those too."""
df["condition_1"] = df["date_2"] >= df["date_1"] + pd.Timedelta(np.timedelta64(11, 'M'))
df["condition_2"] = df["date_3"] >= df["date_2"] + pd.Timedelta(np.timedelta64(11, 'M'))
df.groupby(["desc", "condition_1", "condition_2"]).sum()
#
#desc condition_1 condition_2
#cc-cc False False 143
# True False 81
#cc-dd False False 58
#dd-cc False False 127
# True False 255
#dd-dd False False 27
#other-cc False False 62
# True 54
# True False 37
#other-dd False False 9
# True False 17
Upvotes: 1