CIHAnalytics
CIHAnalytics

Reputation: 163

How can I add rows to a dataframe with zero value?

I have a dataframe with product sales to different accounts. Some of our products we haven't sold to those accounts, but I need it included in the dataframe with a 0 value. Here is what I have:

account degree_type discipline dollars
Harvard 4 Year Accounting 1000
Yale 4 Year Biology 2000
Ivy Tech 2 Year Biology 500
Harvard 4 Year Precalculus 3000

Let's assume we've never sold Biology to Harvard. I need a dataframe that looks like this which includes Harvard Biology at 0:

account degree_type discipline dollars
Harvard 4 Year Accounting 1000
Yale 4 Year Biology 2000
Ivy Tech 2 Year Biology 500
Harvard 4 Year Precalculus 3000
Harvard 4 Year Biology 0

The actual dataset is much bigger, of course, and contains about 3,000 schools and 149 different disciplines. So I need to find all the 0's and then append the values to the dataframe.

Anyone have any thoughts on how I could create these rows without writing over the sales I do have?

Upvotes: 0

Views: 697

Answers (2)

Rodalm
Rodalm

Reputation: 5433

If I understood it correctly, you want to consider all the distinct combinations resulting from the cartesian product of the columns 'account', 'degree_type' and 'discipline'.

In that case, you can do

import pandas as pd

df = pd.DataFrame({'account': {0: 'Harvard', 1: 'Yale', 2: 'Ivy Tech', 3: 'Harvard'},
 'degree_type': {0: '4 Year', 1: '4 Year', 2: '2 Year', 3: '4 Year'},
 'discipline': {0: 'Accounting', 1: 'Biology', 2: 'Biology', 3: 'Precalculus'},
 'dollars': {0: 1000, 1: 2000, 2: 500, 3: 3000}})

# get the unique values of each column 
idx_levels = {col: df[col].unique() for col in ['account', 'degree_type', 'discipline']}

# construct a new index from the cartesian product of those columns 
new_idx = pd.MultiIndex.from_product(idx_levels.values(), names=idx_levels.keys())

# set those columns as the index and reindex the DataFrame to include all possible combinations
# set the 'dollars' of combinations not present in the original DataFrame to zero 
res = df.set_index(list(idx_levels)).reindex(new_idx, fill_value=0).reset_index()
>>> df

    account degree_type   discipline  dollars
0   Harvard      4 Year   Accounting     1000
1      Yale      4 Year      Biology     2000
2  Ivy Tech      2 Year      Biology      500
3   Harvard      4 Year  Precalculus     3000

>>> res 

     account degree_type   discipline  dollars
0    Harvard      4 Year   Accounting     1000
1    Harvard      4 Year      Biology        0
2    Harvard      4 Year  Precalculus     3000
3    Harvard      2 Year   Accounting        0
4    Harvard      2 Year      Biology        0
5    Harvard      2 Year  Precalculus        0
6       Yale      4 Year   Accounting        0
7       Yale      4 Year      Biology     2000
8       Yale      4 Year  Precalculus        0
9       Yale      2 Year   Accounting        0
10      Yale      2 Year      Biology        0
11      Yale      2 Year  Precalculus        0
12  Ivy Tech      4 Year   Accounting        0
13  Ivy Tech      4 Year      Biology        0
14  Ivy Tech      4 Year  Precalculus        0
15  Ivy Tech      2 Year   Accounting        0
16  Ivy Tech      2 Year      Biology      500
17  Ivy Tech      2 Year  Precalculus        0

If the DataFrame contains duplicates...

Note that the above solution doesn't work if the DataFrame contains multiple entries with the same ('account', 'degree_type', 'discipline'). reindex can't handle duplicate indices.

In that case, we can take a different route and use merge instead.

df = pd.DataFrame({'account': {0: 'Harvard', 1: 'Yale', 2: 'Ivy Tech', 3: 'Harvard'},
 'degree_type': {0: '4 Year', 1: '4 Year', 2: '2 Year', 3: '4 Year'},
 'discipline': {0: 'Accounting', 1: 'Biology', 2: 'Biology', 3: 'Precalculus'},
 'dollars': {0: 1000, 1: 2000, 2: 500, 3: 3000}})

# duplicate the DataFrame to simulate non-unique ('account', 'degree_type', 'discipline') combinations
df = pd.concat([df, df], ignore_index=True)

idx_levels = {col: df[col].unique() for col in ['account', 'degree_type', 'discipline']}

new_idx = pd.MultiIndex.from_product(idx_levels.values(), names=idx_levels.keys())

res = ( 
    df.merge(new_idx.to_series(name="idx"), 
             how='right', 
             on=list(idx_levels))
    .fillna(0)
    .drop(columns='idx')
)

>>> df

    account degree_type   discipline  dollars
0   Harvard      4 Year   Accounting     1000
1      Yale      4 Year      Biology     2000
2  Ivy Tech      2 Year      Biology      500
3   Harvard      4 Year  Precalculus     3000
4   Harvard      4 Year   Accounting     1000
5      Yale      4 Year      Biology     2000
6  Ivy Tech      2 Year      Biology      500
7   Harvard      4 Year  Precalculus     3000

# Note that the repeated entries are present in the result 
>>> res

     account degree_type   discipline  dollars
0    Harvard      4 Year   Accounting   1000.0
1    Harvard      4 Year   Accounting   1000.0
2    Harvard      4 Year      Biology      0.0
3    Harvard      4 Year  Precalculus   3000.0
4    Harvard      4 Year  Precalculus   3000.0
5    Harvard      2 Year   Accounting      0.0
6    Harvard      2 Year      Biology      0.0
7    Harvard      2 Year  Precalculus      0.0
8       Yale      4 Year   Accounting      0.0
9       Yale      4 Year      Biology   2000.0
10      Yale      4 Year      Biology   2000.0
11      Yale      4 Year  Precalculus      0.0
12      Yale      2 Year   Accounting      0.0
13      Yale      2 Year      Biology      0.0
14      Yale      2 Year  Precalculus      0.0
15  Ivy Tech      4 Year   Accounting      0.0
16  Ivy Tech      4 Year      Biology      0.0
17  Ivy Tech      4 Year  Precalculus      0.0
18  Ivy Tech      2 Year   Accounting      0.0
19  Ivy Tech      2 Year      Biology    500.0
20  Ivy Tech      2 Year      Biology    500.0
21  Ivy Tech      2 Year  Precalculus      0.0

Upvotes: 1

BENY
BENY

Reputation: 323226

You can append

out = df.append(pd.Series(['Harvard','4 Year','Biology',0],index=df.columns),ignore_index=True)
Out[49]: 
    account degree_type   discipline  dollars
0   Harvard      4 Year   Accounting     1000
1      Yale      4 Year      Biology     2000
2  Ivy Tech      2 Year      Biology      500
3   Harvard      4 Year  Precalculus     3000
4   Harvard      4 Year      Biology        0

Upvotes: 1

Related Questions