Reputation: 1469
I have the following table as a pandas dataframe
:
| ID | Name | Sales | Source |
|----|------|-------|----------|
| 1 | a | 34 | Source A |
| 2 | b | 3423 | Source A |
| 3 | c | 2 | Source A |
| 4 | d | 342 | Source A |
| 3 | c | 34 | Source A |
| 5 | e | 234 | Source A |
| 6 | f | 234 | Source A |
| 7 | g | 23 | Source A |
| 1 | a | 12 | Source B |
| 2 | b | 42 | Source B |
| 3 | c | 9 | Source B |
| 2 | b | 22 | Source B |
| 1 | a | 1 | Source B |
| 8 | h | 56 | Source B |
What is the best way to (i) aggregate sales for each ID for each soure and (ii) put the result in two new columns "Source A" and "Source B" such that the resulting dataframe
looks as follows:
| ID | Name | Source A | Source B |
|----|------|----------|----------|
| 1 | a | 34 | 13 |
| 2 | b | 3423 | 64 |
| 3 | c | 36 | 9 |
| 4 | d | 342 | 0 |
| 5 | e | 234 | 0 |
| 6 | f | 234 | 0 |
| 7 | g | 23 | 0 |
| 8 | h | 0 | 56 |
My initial approach was as follows:
data = {"ID":[1,2,3,4,3,5,6,7,1,2,3,2,1,8],
"Name":list("abcdcefgabcbah"),
"Sales":[34,3423,2,342,34,234,234,23,12,42,9,22,1,56],
"Source":["Source A"]*8 + ["Source B"]*6
}
df = pd.DataFrame(data)
df.groupby(["ID","Name","Source"])["Sales"].sum().unstack()
Question: my initial table is build using different files and than applying pd.concat
. So it feels like I could achieve the final table by concatenating (or merging) differently in the first place. Is there a better approach to achieve this? As a side node: the actual data table consists out of 6 different sources.
Thanks for your help!
Upvotes: 1
Views: 295
Reputation: 13768
Try the following code:
df.groupby(['Name', 'Source'])['Sales'].sum()\
.unstack(1).fillna(0).reset_index()
Upvotes: 1
Reputation: 153500
pd.crosstab
:pd.crosstab(df.Name, df.Source, df.Sales, aggfunc='sum').fillna(0)
Output:
Source Source A Source B
Name
a 34.0 13.0
b 3423.0 64.0
c 36.0 9.0
d 342.0 0.0
e 234.0 0.0
f 234.0 0.0
g 23.0 0.0
h 0.0 56.0
df.pivot_table('Sales','Name','Source', aggfunc='sum').fillna(0)
Output:
Source Source A Source B
Name
a 34.0 13.0
b 3423.0 64.0
c 36.0 9.0
d 342.0 0.0
e 234.0 0.0
f 234.0 0.0
g 23.0 0.0
h 0.0 56.0
set_index
and sum
with level
parameter then unstack
:df.set_index(['Name','Source'])['Sales'].sum(level=[0,1]).unstack(fill_value=0)
Output:
Source Source A Source B
Name
a 34 13
b 3423 64
c 36 9
d 342 0
e 234 0
f 234 0
g 23 0
h 0 56
Upvotes: 3