James
James

Reputation: 3815

Using Pandas to build a 2D table based on COUNTIF()'s of a separate excel sheet

I would like to build a 2D table based on values (and countifs) from another table. I managed to prototype this successfully using Excel, however I am stuck with two concepts:

1. Emulating Excel COUNTIF() on pandas
2. Dynamically build a new dataframe

Note: COUNTIF() takes a range and an criterion as an argument. For example if I have a list of colors and I would like to know the number of times 'orange' is in the list below:

A
Red
Orange
Blue
Orange
Black

, then I would simply use the following formula:

COUNTIF(A1:A5, "Orange")

This should return 2.

Of course COUNTIF() functions can become more complex, such as form example concatenating criteria in this form COUNTIF(range1, criterion1, range2, criterion2...) can be interpreted as an AND criterian. Example if I wantto count of females over 35 in a list similar to the below:

A             B
Female        19
Female        40
Male          45

, then I would simply use the following formula:

COUNTIF(A1:A3, "Female", B1:B3, ">35"

This should return 1.

Back to my use case. This is the source table:

   Product No Opening Date Closing Date Opening Month Closing Month
0           1   2016-01-01   2016-06-30    2016-01-31    2016-06-30
1           2   2016-01-01   2016-04-30    2016-01-31    2016-04-30
2           3   2016-02-01   2016-06-30    2016-02-29    2016-06-30
3           4   2016-02-01   2016-05-31    2016-02-29    2016-05-31
4           5   2016-02-01   2099-12-31    2016-02-29    2099-12-31
5           6   2016-01-01   2099-12-31    2016-01-31    2016-10-31
6           7   2016-06-01   2016-07-31    2016-06-30    2016-07-31
7           8   2016-06-01   2016-11-30    2016-06-30    2016-11-30
8           9   2016-06-01   2016-07-31    2016-06-30    2016-07-31
9          10   2016-06-01   2099-12-31    2016-06-30    2099-12-31

And this is the 2d matrix that I want to achieve:

            2016-01-31  2016-02-29  2016-03-31  2016-04-30  2016-05-31  \
2016-01-31           3           3           3           2           2
2016-02-29           3           3           3           3           2
2016-03-31           0           0           0           0           0
2016-04-30           0           0           0           0           0
2016-05-31           0           0           0           0           0
2016-06-30           4           4           4           4           4
2016-07-31           0           0           0           0           0
2016-08-31           0           0           0           0           0
2016-09-30           0           0           0           0           0
2016-10-31           0           0           0           0           0
2016-11-30           0           0           0           0           0
2016-12-31           0           0           0           0           0

            2016-06-30  2016-07-31  2016-08-31  2016-09-30  2016-10-31  \
2016-01-31           1           1           1           1           0
2016-02-29           1           1           1           1           1
2016-03-31           0           0           0           0           0
2016-04-30           0           0           0           0           0
2016-05-31           0           0           0           0           0
2016-06-30           4           2           2           2           2
2016-07-31           0           0           0           0           0
2016-08-31           0           0           0           0           0
2016-09-30           0           0           0           0           0
2016-10-31           0           0           0           0           0
2016-11-30           0           0           0           0           0
2016-12-31           0           0           0           0           0

            2016-11-30  2016-12-31
2016-01-31           0           0
2016-02-29           1           1
2016-03-31           0           0
2016-04-30           0           0
2016-05-31           0           0
2016-06-30           1           1
2016-07-31           0           0
2016-08-31           0           0
2016-09-30           0           0
2016-10-31           0           0
2016-11-30           0           0
2016-12-31           0           0

Basically I want to build a matrix of product survival through time. The vertical axis holds the origination of new products whereas the horizontal axis measures how much of these accounts persist through time.

For example if 10 products were launched in January, the figure for January vs January should be 10. If 1 of these 10 products was closed in February, the figure for January vs February should be 9. If all the remaining products were closed by June, then the rows January vs June, July, August, etc should be 0.

Product development in February, March, April, etc.. will not affect the January row.

I managed to build the 2d matrix using the following excel formula:

=COUNTIF(Accounts!$D$2:$D$11,Main!$A2)-COUNTIFS(Accounts!$D$2:$D$11,Main!$A2, Accounts!$E$2:$E$11,"<="&Main!B$1)

(this will populate the first cell)

My initial strategy was to build a multi-dimensional list and using a number of for-loops to populate them, but I am not sure whether there's an easier (or more recommended way) in Pandas.

Upvotes: 0

Views: 377

Answers (2)

Ian Thompson
Ian Thompson

Reputation: 3305

Since I don't have enough reputation to comment on your question just yet, I'm going to assume that you have typos in your data where the year is equal to 2099.

I would also like to ask how in your 2016-06-30 row there are 4 'Product No' that somehow existed in the first few columns (i.e. 2016-01-31 to 2016-05-31).

If those are errors then here is my solution:

First, make the data:

# Make dataframe
df = pd.DataFrame({'Product No' : [i for i in range(1,11)],
                   'Opening Date' : ['2016-01-01']*2 +\
                                    ['2016-02-01']*3 +\
                                    ['2016-01-01'] +\
                                    ['2016-06-01']*4,
                   'Closing Date' : ['2016-06-30', '2016-04-30', '2016-06-30', '2016-05-31'] +\
                                    ['2016-12-31']*2 +\
                                    ['2016-07-31', '2016-11-30', '2016-07-31', '2016-12-31'],
                   'Opening Month' : ['2016-01-31']*2 +\
                                     ['2016-02-29']*3 +\
                                     ['2016-01-31'] +\
                                     ['2016-06-30']*4,
                   'Closing Month' : ['2016-06-30', '2016-04-30', '2016-06-30', '2016-05-31',
                                      '2016-12-31', '2016-10-31', '2016-07-31', '2016-11-30',
                                      '2016-07-31', '2016-12-31']})

# Reorder columns
df = df.loc[:, ['Product No', 'Opening Date', 'Closing Date',
                'Opening Month', 'Closing Month']]

# Convert dates to datetime
for i in df.columns[1:]:
    df.loc[:, i] = pd.to_datetime(df.loc[:, i])

Second, I created a 'daterange' dataframe for holding the min to max dates of the original data set. I also included a 'Product No' column so that each Product would have a row on the table:

# Create date range dataframe
daterange = pd.DataFrame({'daterange' : pd.date_range(start = df.loc[:, 'Opening Month'].min(),
                                                  end = df.loc[:, 'Closing Month'].max(),
                                                  freq = 'M'),
                          'Product No' : [1]*12})

# Create 10 multiples of the daterange and concatenate
daterange10 = pd.concat([daterange]*10)

# Find the cumulative sum of the 'Product No' for daterange10
daterange10.loc[:, 'Product No'] = daterange10.groupby('daterange').cumsum()

Third, I merge the daterange and original df together and limit rows to only include when a 'Product No' existed. Also note that I have it so the closed dates must be greater than or equal to the daterange since (in my opinion) if the product closed on the last day of the month, then it existed during that whole month:

# Merge df with daterange10
df = df.merge(daterange10,
              how = 'inner',
              on = 'Product No')

# Limit rows to when 'Opening Month' is <= 'daterange' and 'Closing Month' is >= 'daterange'
df = df[(df.loc[:, 'Opening Month'] <= df.loc[:, 'daterange']) &
        (df.loc[:, 'Closing Month'] >= df.loc[:, 'daterange'])]

Last, I make a pivot table with the date values. Note that it only includes dates on the vertical axis that existed in the first place:

# Pivot on 'Opening Month', 'daterange'; count unique 'Product No'; fill NA with 0
df.pivot_table(index = 'Opening Month',
               columns = 'daterange',
               values = 'Product No',
               aggfunc = pd.Series.nunique).fillna(0)

pivot_table

Upvotes: 1

Dan
Dan

Reputation: 671

Try putting your data into a pandas DataFrame, then using an iterative approach to build the product survival DataFrame:

import pandas as pd

mydata = pd.read_excel('mysourcedata.xlsx')

def product_survival(sourcedf, startdate, enddate):

    df = pd.DataFrame()

    daterange = pd.date_range(startdate, enddate, freq='M')

    for i in daterange: # Rows
        for j in daterange: # Columns
            mycount = sourcedf[(sourcedf['Opening Month'] == i) & (sourcedf['Closing Month'] > j)]['Product No'].count()
            df.loc[i, j] = mycount

    return df

print(product_survival(mydata, '2016-01-31', '2016-12-31'))

Upvotes: 0

Related Questions