Reputation: 3815
I would like to build a 2D table based on values (and countifs) from another table. I managed to prototype this successfully using Excel, however I am stuck with two concepts:
1. Emulating Excel COUNTIF() on pandas
2. Dynamically build a new dataframe
Note: COUNTIF() takes a range and an criterion as an argument. For example if I have a list of colors and I would like to know the number of times 'orange' is in the list below:
A
Red
Orange
Blue
Orange
Black
, then I would simply use the following formula:
COUNTIF(A1:A5, "Orange")
This should return 2.
Of course COUNTIF() functions can become more complex, such as form example concatenating criteria in this form COUNTIF(range1, criterion1, range2, criterion2...) can be interpreted as an AND criterian. Example if I wantto count of females over 35 in a list similar to the below:
A B
Female 19
Female 40
Male 45
, then I would simply use the following formula:
COUNTIF(A1:A3, "Female", B1:B3, ">35"
This should return 1.
Back to my use case. This is the source table:
Product No Opening Date Closing Date Opening Month Closing Month
0 1 2016-01-01 2016-06-30 2016-01-31 2016-06-30
1 2 2016-01-01 2016-04-30 2016-01-31 2016-04-30
2 3 2016-02-01 2016-06-30 2016-02-29 2016-06-30
3 4 2016-02-01 2016-05-31 2016-02-29 2016-05-31
4 5 2016-02-01 2099-12-31 2016-02-29 2099-12-31
5 6 2016-01-01 2099-12-31 2016-01-31 2016-10-31
6 7 2016-06-01 2016-07-31 2016-06-30 2016-07-31
7 8 2016-06-01 2016-11-30 2016-06-30 2016-11-30
8 9 2016-06-01 2016-07-31 2016-06-30 2016-07-31
9 10 2016-06-01 2099-12-31 2016-06-30 2099-12-31
And this is the 2d matrix that I want to achieve:
2016-01-31 2016-02-29 2016-03-31 2016-04-30 2016-05-31 \
2016-01-31 3 3 3 2 2
2016-02-29 3 3 3 3 2
2016-03-31 0 0 0 0 0
2016-04-30 0 0 0 0 0
2016-05-31 0 0 0 0 0
2016-06-30 4 4 4 4 4
2016-07-31 0 0 0 0 0
2016-08-31 0 0 0 0 0
2016-09-30 0 0 0 0 0
2016-10-31 0 0 0 0 0
2016-11-30 0 0 0 0 0
2016-12-31 0 0 0 0 0
2016-06-30 2016-07-31 2016-08-31 2016-09-30 2016-10-31 \
2016-01-31 1 1 1 1 0
2016-02-29 1 1 1 1 1
2016-03-31 0 0 0 0 0
2016-04-30 0 0 0 0 0
2016-05-31 0 0 0 0 0
2016-06-30 4 2 2 2 2
2016-07-31 0 0 0 0 0
2016-08-31 0 0 0 0 0
2016-09-30 0 0 0 0 0
2016-10-31 0 0 0 0 0
2016-11-30 0 0 0 0 0
2016-12-31 0 0 0 0 0
2016-11-30 2016-12-31
2016-01-31 0 0
2016-02-29 1 1
2016-03-31 0 0
2016-04-30 0 0
2016-05-31 0 0
2016-06-30 1 1
2016-07-31 0 0
2016-08-31 0 0
2016-09-30 0 0
2016-10-31 0 0
2016-11-30 0 0
2016-12-31 0 0
Basically I want to build a matrix of product survival through time. The vertical axis holds the origination of new products whereas the horizontal axis measures how much of these accounts persist through time.
For example if 10 products were launched in January, the figure for January vs January should be 10. If 1 of these 10 products was closed in February, the figure for January vs February should be 9. If all the remaining products were closed by June, then the rows January vs June, July, August, etc should be 0.
Product development in February, March, April, etc.. will not affect the January row.
I managed to build the 2d matrix using the following excel formula:
=COUNTIF(Accounts!$D$2:$D$11,Main!$A2)-COUNTIFS(Accounts!$D$2:$D$11,Main!$A2, Accounts!$E$2:$E$11,"<="&Main!B$1)
(this will populate the first cell)
My initial strategy was to build a multi-dimensional list and using a number of for-loops to populate them, but I am not sure whether there's an easier (or more recommended way) in Pandas.
Upvotes: 0
Views: 377
Reputation: 3305
Since I don't have enough reputation to comment on your question just yet, I'm going to assume that you have typos in your data where the year is equal to 2099.
I would also like to ask how in your 2016-06-30 row there are 4 'Product No' that somehow existed in the first few columns (i.e. 2016-01-31 to 2016-05-31).
If those are errors then here is my solution:
First, make the data:
# Make dataframe
df = pd.DataFrame({'Product No' : [i for i in range(1,11)],
'Opening Date' : ['2016-01-01']*2 +\
['2016-02-01']*3 +\
['2016-01-01'] +\
['2016-06-01']*4,
'Closing Date' : ['2016-06-30', '2016-04-30', '2016-06-30', '2016-05-31'] +\
['2016-12-31']*2 +\
['2016-07-31', '2016-11-30', '2016-07-31', '2016-12-31'],
'Opening Month' : ['2016-01-31']*2 +\
['2016-02-29']*3 +\
['2016-01-31'] +\
['2016-06-30']*4,
'Closing Month' : ['2016-06-30', '2016-04-30', '2016-06-30', '2016-05-31',
'2016-12-31', '2016-10-31', '2016-07-31', '2016-11-30',
'2016-07-31', '2016-12-31']})
# Reorder columns
df = df.loc[:, ['Product No', 'Opening Date', 'Closing Date',
'Opening Month', 'Closing Month']]
# Convert dates to datetime
for i in df.columns[1:]:
df.loc[:, i] = pd.to_datetime(df.loc[:, i])
Second, I created a 'daterange' dataframe for holding the min to max dates of the original data set. I also included a 'Product No' column so that each Product would have a row on the table:
# Create date range dataframe
daterange = pd.DataFrame({'daterange' : pd.date_range(start = df.loc[:, 'Opening Month'].min(),
end = df.loc[:, 'Closing Month'].max(),
freq = 'M'),
'Product No' : [1]*12})
# Create 10 multiples of the daterange and concatenate
daterange10 = pd.concat([daterange]*10)
# Find the cumulative sum of the 'Product No' for daterange10
daterange10.loc[:, 'Product No'] = daterange10.groupby('daterange').cumsum()
Third, I merge the daterange and original df together and limit rows to only include when a 'Product No' existed. Also note that I have it so the closed dates must be greater than or equal to the daterange since (in my opinion) if the product closed on the last day of the month, then it existed during that whole month:
# Merge df with daterange10
df = df.merge(daterange10,
how = 'inner',
on = 'Product No')
# Limit rows to when 'Opening Month' is <= 'daterange' and 'Closing Month' is >= 'daterange'
df = df[(df.loc[:, 'Opening Month'] <= df.loc[:, 'daterange']) &
(df.loc[:, 'Closing Month'] >= df.loc[:, 'daterange'])]
Last, I make a pivot table with the date values. Note that it only includes dates on the vertical axis that existed in the first place:
# Pivot on 'Opening Month', 'daterange'; count unique 'Product No'; fill NA with 0
df.pivot_table(index = 'Opening Month',
columns = 'daterange',
values = 'Product No',
aggfunc = pd.Series.nunique).fillna(0)
Upvotes: 1
Reputation: 671
Try putting your data into a pandas DataFrame, then using an iterative approach to build the product survival DataFrame:
import pandas as pd
mydata = pd.read_excel('mysourcedata.xlsx')
def product_survival(sourcedf, startdate, enddate):
df = pd.DataFrame()
daterange = pd.date_range(startdate, enddate, freq='M')
for i in daterange: # Rows
for j in daterange: # Columns
mycount = sourcedf[(sourcedf['Opening Month'] == i) & (sourcedf['Closing Month'] > j)]['Product No'].count()
df.loc[i, j] = mycount
return df
print(product_survival(mydata, '2016-01-31', '2016-12-31'))
Upvotes: 0