Reputation: 163
I have a dataframe with product sales to different accounts. Some of our products we haven't sold to those accounts, but I need it included in the dataframe with a 0 value. Here is what I have:
account | degree_type | discipline | dollars |
---|---|---|---|
Harvard | 4 Year | Accounting | 1000 |
Yale | 4 Year | Biology | 2000 |
Ivy Tech | 2 Year | Biology | 500 |
Harvard | 4 Year | Precalculus | 3000 |
Let's assume we've never sold Biology to Harvard. I need a dataframe that looks like this which includes Harvard Biology at 0:
account | degree_type | discipline | dollars |
---|---|---|---|
Harvard | 4 Year | Accounting | 1000 |
Yale | 4 Year | Biology | 2000 |
Ivy Tech | 2 Year | Biology | 500 |
Harvard | 4 Year | Precalculus | 3000 |
Harvard | 4 Year | Biology | 0 |
The actual dataset is much bigger, of course, and contains about 3,000 schools and 149 different disciplines. So I need to find all the 0's and then append the values to the dataframe.
Anyone have any thoughts on how I could create these rows without writing over the sales I do have?
Upvotes: 0
Views: 697
Reputation: 5433
If I understood it correctly, you want to consider all the distinct combinations resulting from the cartesian product of the columns 'account', 'degree_type' and 'discipline'.
In that case, you can do
import pandas as pd
df = pd.DataFrame({'account': {0: 'Harvard', 1: 'Yale', 2: 'Ivy Tech', 3: 'Harvard'},
'degree_type': {0: '4 Year', 1: '4 Year', 2: '2 Year', 3: '4 Year'},
'discipline': {0: 'Accounting', 1: 'Biology', 2: 'Biology', 3: 'Precalculus'},
'dollars': {0: 1000, 1: 2000, 2: 500, 3: 3000}})
# get the unique values of each column
idx_levels = {col: df[col].unique() for col in ['account', 'degree_type', 'discipline']}
# construct a new index from the cartesian product of those columns
new_idx = pd.MultiIndex.from_product(idx_levels.values(), names=idx_levels.keys())
# set those columns as the index and reindex the DataFrame to include all possible combinations
# set the 'dollars' of combinations not present in the original DataFrame to zero
res = df.set_index(list(idx_levels)).reindex(new_idx, fill_value=0).reset_index()
>>> df
account degree_type discipline dollars
0 Harvard 4 Year Accounting 1000
1 Yale 4 Year Biology 2000
2 Ivy Tech 2 Year Biology 500
3 Harvard 4 Year Precalculus 3000
>>> res
account degree_type discipline dollars
0 Harvard 4 Year Accounting 1000
1 Harvard 4 Year Biology 0
2 Harvard 4 Year Precalculus 3000
3 Harvard 2 Year Accounting 0
4 Harvard 2 Year Biology 0
5 Harvard 2 Year Precalculus 0
6 Yale 4 Year Accounting 0
7 Yale 4 Year Biology 2000
8 Yale 4 Year Precalculus 0
9 Yale 2 Year Accounting 0
10 Yale 2 Year Biology 0
11 Yale 2 Year Precalculus 0
12 Ivy Tech 4 Year Accounting 0
13 Ivy Tech 4 Year Biology 0
14 Ivy Tech 4 Year Precalculus 0
15 Ivy Tech 2 Year Accounting 0
16 Ivy Tech 2 Year Biology 500
17 Ivy Tech 2 Year Precalculus 0
If the DataFrame contains duplicates...
Note that the above solution doesn't work if the DataFrame contains multiple entries with the same ('account', 'degree_type', 'discipline'). reindex
can't handle duplicate indices.
In that case, we can take a different route and use merge
instead.
df = pd.DataFrame({'account': {0: 'Harvard', 1: 'Yale', 2: 'Ivy Tech', 3: 'Harvard'},
'degree_type': {0: '4 Year', 1: '4 Year', 2: '2 Year', 3: '4 Year'},
'discipline': {0: 'Accounting', 1: 'Biology', 2: 'Biology', 3: 'Precalculus'},
'dollars': {0: 1000, 1: 2000, 2: 500, 3: 3000}})
# duplicate the DataFrame to simulate non-unique ('account', 'degree_type', 'discipline') combinations
df = pd.concat([df, df], ignore_index=True)
idx_levels = {col: df[col].unique() for col in ['account', 'degree_type', 'discipline']}
new_idx = pd.MultiIndex.from_product(idx_levels.values(), names=idx_levels.keys())
res = (
df.merge(new_idx.to_series(name="idx"),
how='right',
on=list(idx_levels))
.fillna(0)
.drop(columns='idx')
)
>>> df
account degree_type discipline dollars
0 Harvard 4 Year Accounting 1000
1 Yale 4 Year Biology 2000
2 Ivy Tech 2 Year Biology 500
3 Harvard 4 Year Precalculus 3000
4 Harvard 4 Year Accounting 1000
5 Yale 4 Year Biology 2000
6 Ivy Tech 2 Year Biology 500
7 Harvard 4 Year Precalculus 3000
# Note that the repeated entries are present in the result
>>> res
account degree_type discipline dollars
0 Harvard 4 Year Accounting 1000.0
1 Harvard 4 Year Accounting 1000.0
2 Harvard 4 Year Biology 0.0
3 Harvard 4 Year Precalculus 3000.0
4 Harvard 4 Year Precalculus 3000.0
5 Harvard 2 Year Accounting 0.0
6 Harvard 2 Year Biology 0.0
7 Harvard 2 Year Precalculus 0.0
8 Yale 4 Year Accounting 0.0
9 Yale 4 Year Biology 2000.0
10 Yale 4 Year Biology 2000.0
11 Yale 4 Year Precalculus 0.0
12 Yale 2 Year Accounting 0.0
13 Yale 2 Year Biology 0.0
14 Yale 2 Year Precalculus 0.0
15 Ivy Tech 4 Year Accounting 0.0
16 Ivy Tech 4 Year Biology 0.0
17 Ivy Tech 4 Year Precalculus 0.0
18 Ivy Tech 2 Year Accounting 0.0
19 Ivy Tech 2 Year Biology 500.0
20 Ivy Tech 2 Year Biology 500.0
21 Ivy Tech 2 Year Precalculus 0.0
Upvotes: 1
Reputation: 323226
You can append
out = df.append(pd.Series(['Harvard','4 Year','Biology',0],index=df.columns),ignore_index=True)
Out[49]:
account degree_type discipline dollars
0 Harvard 4 Year Accounting 1000
1 Yale 4 Year Biology 2000
2 Ivy Tech 2 Year Biology 500
3 Harvard 4 Year Precalculus 3000
4 Harvard 4 Year Biology 0
Upvotes: 1