Reputation: 1251
Background: We have a ticketing system and each ticket has fields such as open date, close date, category, type, etc. Each ticket is represented by one row in my data with a key to identify the ticket.
An individual record might look like:
Number, Type, Category, Opened, Closed
TICKET100, Database, Software, 2/1/2020 11:30 AM, 4/22/2020 4:40 PM
Goal: My goal is to create a function that takes an input dataframe (pandas), some type of input age parameter, and a list of attributes/dimensions. The function would then return a dataframe of how many tickets were open past a certain age based on that age parameter, by each of the attributes/dimensions, by date.
Example of input:
ticket_age(input_dataframe, age=5, dimensions=['Type','Category'])
Example snippet of desired output:
Date, Type, Category, Count
3/1/2020, Database, Software, 1
3/2/2020, Database, Software, 1
...
4/22/2020, Database, Software, 0
An important note about the output... if the intersection of the date and the dimensions doesn't have any tickets that satisfy the condition, it should create a row with a count of 0.
Where I'm stuck: I have no idea how to accept the dimensions of a list of unknown size and iterate through all of them.
What have I tried? I've been successful in generating the ticket counts that meet the age criteria when I hard code the loops through dimensions.
How I calculate type_list, first_date, and total_days:
#Create function to find the minimum date
def date_minimum(input_dataframe, date_to_check):
return input_dataframe[date_to_check].min().date()
#Create function to find the maximum date
def date_maximum(input_dataframe,date_to_check):
return input_dataframe[date_to_check].max().date()
#setup min and max dates
min_date=date_minimum(df_aged_input,'Opened')
max_date=date_maximum(df_aged_input,'Closed')
#Get the first relevant date for the dataframe loop
first_date=min_date+datetime.timedelta(days=aged_window)
# Generate a list of unique assignment groups
type_list=df_aged_input['Type'].unique().tolist()
My loop
aged_output_list=[]
for type_iterate in range(len(type_list)):
#filter by the type
aged_type=type_list[type_iterate]
df_aged_input=df_tkt_relevant[df_tkt_relevant['Type']==aged_type].copy()
for date_iterate in range(totalDays.days):
#generate the aged date iterator
aged_date=first_date+datetime.timedelta(days=date_iterate)
#Count the number of records in the data frame that match the input conditions
aged_frame=df_aged_input[(~(df_aged_input['Closed'].dt.date<aged_date))&(df_aged_input['Opened'].dt.date<(aged_date-datetime.timedelta(days=aged_window+1)))].copy()
aged_frame['aged_Date']=aged_date-datetime.timedelta(days=1)
aged_count=aged_frame.shape[0]
#Write the date from iterative date and the aged count to a new data source
aged_output_list.append([aged_date,aged_type or 'Error: Missing Type',aged_count])
What do I do next? Is there another library that does all of this for me?
Upvotes: 4
Views: 569
Reputation: 14113
IIUC
import pandas as pd
from pandas.tseries.offsets import DateOffset
from io import StringIO
import numpy as np
# sample data
s = """Number,Type,Category,Opened,Closed
TICKET100,Database,Software,2/1/2020 11:30 AM,2/22/2020 4:40 PM
TICKET101,Database,Software,2/10/2020 11:30 AM,2/23/2020 4:40 PM
TICKET102,Database,Software,2/11/2020 11:30 AM,2/22/2020 4:40 PM
TICKET103,something,else,2/10/2020 11:30 AM,2/23/2020 4:40 PM
TICKET104,something,else,2/12/2020 11:30 AM,2/22/2020 4:40 PM"""
df = pd.read_csv(StringIO(s))
# convert to datetime
df['Opened'] = pd.to_datetime(df['Opened'])
df['Closed'] = pd.to_datetime(df['Closed'])
# create a function
def myFunc(df, age, dimensions):
# zip the date offset for open and close together
z = zip(df['Opened'].dt.date + DateOffset(age), df['Closed'].dt.date - DateOffset(age))
# create a daterange for each record in the dataframe
df['date_range'] = [pd.date_range(o, c, freq='D') for o,c in z]
# groupby your dimensions columns
g = df.groupby(dimensions)
# list comprehension for each group and create the date range
idxs = [pd.date_range(df['Opened'].min(), df['Closed'].max(), freq='D',
normalize=True) for _,df in g]
# sum all the true values where the datetime index create above
# is in the date_range column
l = [[sum(df[1]['date_range'].apply(lambda x: idx.isin(x))),
idx, [df[0]]*len(idx)] for idx,df in zip(idxs, g)]
# transpose and hstack the list
arr = np.transpose(np.hstack(l))
# return pandas dataframe
return pd.DataFrame(arr, columns=['count', 'date', 'dimensions'])
out = myFunc(df, 5, ['Type', 'Category'])
count date dimensions
0 0 2020-02-01 (Database, Software)
1 0 2020-02-02 (Database, Software)
2 0 2020-02-03 (Database, Software)
3 0 2020-02-04 (Database, Software)
4 0 2020-02-05 (Database, Software)
5 1 2020-02-06 (Database, Software)
6 1 2020-02-07 (Database, Software)
7 1 2020-02-08 (Database, Software)
8 1 2020-02-09 (Database, Software)
9 1 2020-02-10 (Database, Software)
10 1 2020-02-11 (Database, Software)
11 1 2020-02-12 (Database, Software)
12 1 2020-02-13 (Database, Software)
13 1 2020-02-14 (Database, Software)
14 2 2020-02-15 (Database, Software)
15 3 2020-02-16 (Database, Software)
16 3 2020-02-17 (Database, Software)
17 1 2020-02-18 (Database, Software)
18 0 2020-02-19 (Database, Software)
19 0 2020-02-20 (Database, Software)
20 0 2020-02-21 (Database, Software)
21 0 2020-02-22 (Database, Software)
22 0 2020-02-23 (Database, Software)
23 0 2020-02-10 (something, else)
24 0 2020-02-11 (something, else)
25 0 2020-02-12 (something, else)
26 0 2020-02-13 (something, else)
27 0 2020-02-14 (something, else)
28 1 2020-02-15 (something, else)
29 1 2020-02-16 (something, else)
30 2 2020-02-17 (something, else)
31 1 2020-02-18 (something, else)
32 0 2020-02-19 (something, else)
33 0 2020-02-20 (something, else)
34 0 2020-02-21 (something, else)
35 0 2020-02-22 (something, else)
36 0 2020-02-23 (something, else)
Upvotes: 3