Reputation: 76
Objective: I need to show the trend in ageing of issues. e.g. for each date in 2021 show the average age of the issues that were open as at that date.
Starting data (Historic issue list):. "df"
ref | Created | resolved |
---|---|---|
a1 | 20/4/2021 | 15/5/2021 |
a2 | 21/4/2021 | 15/6/2021 |
a3 | 23/4/2021 | 15/7/2021 |
Endpoint: "df2"
Date | Avg_age |
---|---|
1/1/2021 | x |
2/1/2021 | y |
3/1/2021 | z |
where x,y,z are the calculated averages of age for all issues open on the Date.
Tried so far: I got this to work in what feels like a very poor way.
hist_dates = pd.date_range(start="2021-01-01",end="2021-12-31"),freq="D")
result_list = []
for each_date in hist_dates:
f1=df.Created < each_date #filter1
f2=df.Resolved >= each_date #filter2
df['Refdate'] = each_date #make column to allow refdate-created
df['Age']= (df.Refdate - df.Created)
results_list.append(df[f1 & f2]).Age.mean())
Problems: This works, but it feels sloppy and it doesn't seem fast. The current data-set is small, but I suspect this wouldn't scale well. I'm trying not to solve everything with loops as I understand it is a common mistake for beginners like me.
Upvotes: 1
Views: 278
Reputation: 2720
I'll give you two solutions: the first one is step-by-step for you to understand the idea and process, the second one replicates the functionality in a much more condensed way, skipping some intermediate steps
First, create a new column that holds your issue age, i.e. df['age'] = df.resolved - df.Created
(I'm assuming your columns are of datetime
type, if not, use pd.to_datetime
to convert them)
You can then use groupby
to group your data by creation date. This will internally slice your dataframe into several pieces, one for each distinct value of Created
, grouping all values with the same creation date together. This way, you can then use aggregation on a creation date level to get the average issue age like so
# [['Created', 'age']] selects only the columns you are interested in
df[['Created', 'age']].groupby('Created').mean()
With an additional fourth data point [a4, 2021/4/20, 2021/4/30]
(to enable some proper aggregation), this would end up giving you the following Series with the average issue age by creation date:
age
Created
2021-04-20 17 days 12:00:00
2021-04-21 55 days 00:00:00
2021-04-23 83 days 00:00:00
A more condensed way of doing this is by defining a custom function and apply
it to each creation date grouping
def issue_age(s: pd.Series):
return (s['resolved'] - s['Created']).mean()
df.groupby('Created').apply(issue_age)
This call will give you the same Series
as before.
Upvotes: 1