FredMaster
FredMaster

Reputation: 1469

Python Pandas: replace groupby operation

I have the following table as a pandas dataframe:

| ID | Name | Sales | Source   |
|----|------|-------|----------|
| 1  | a    | 34    | Source A |
| 2  | b    | 3423  | Source A |
| 3  | c    | 2     | Source A |
| 4  | d    | 342   | Source A |
| 3  | c    | 34    | Source A |
| 5  | e    | 234   | Source A |
| 6  | f    | 234   | Source A |
| 7  | g    | 23    | Source A |
| 1  | a    | 12    | Source B |
| 2  | b    | 42    | Source B |
| 3  | c    | 9     | Source B |
| 2  | b    | 22    | Source B |
| 1  | a    | 1     | Source B |
| 8  | h    | 56    | Source B |

What is the best way to (i) aggregate sales for each ID for each soure and (ii) put the result in two new columns "Source A" and "Source B" such that the resulting dataframe looks as follows:

| ID | Name | Source A | Source B |
|----|------|----------|----------|
| 1  | a    | 34       | 13       |
| 2  | b    | 3423     | 64       |
| 3  | c    | 36       | 9        |
| 4  | d    | 342      | 0        |
| 5  | e    | 234      | 0        |
| 6  | f    | 234      | 0        |
| 7  | g    | 23       | 0        |
| 8  | h    | 0        | 56       |

My initial approach was as follows:

data = {"ID":[1,2,3,4,3,5,6,7,1,2,3,2,1,8], 
      "Name":list("abcdcefgabcbah"), 
      "Sales":[34,3423,2,342,34,234,234,23,12,42,9,22,1,56],
      "Source":["Source A"]*8 + ["Source B"]*6
     }
df = pd.DataFrame(data)

df.groupby(["ID","Name","Source"])["Sales"].sum().unstack()

Question: my initial table is build using different files and than applying pd.concat. So it feels like I could achieve the final table by concatenating (or merging) differently in the first place. Is there a better approach to achieve this? As a side node: the actual data table consists out of 6 different sources.

Thanks for your help!

Upvotes: 1

Views: 295

Answers (2)

8one6
8one6

Reputation: 13768

Try the following code:

df.groupby(['Name', 'Source'])['Sales'].sum()\
    .unstack(1).fillna(0).reset_index()

Upvotes: 1

Scott Boston
Scott Boston

Reputation: 153500

You can use pd.crosstab:

pd.crosstab(df.Name, df.Source, df.Sales, aggfunc='sum').fillna(0)

Output:

Source  Source A  Source B
Name                      
a           34.0      13.0
b         3423.0      64.0
c           36.0       9.0
d          342.0       0.0
e          234.0       0.0
f          234.0       0.0
g           23.0       0.0
h            0.0      56.0

Or, pivot_table

df.pivot_table('Sales','Name','Source', aggfunc='sum').fillna(0)

Output:

Source  Source A  Source B
Name                      
a           34.0      13.0
b         3423.0      64.0
c           36.0       9.0
d          342.0       0.0
e          234.0       0.0
f          234.0       0.0
g           23.0       0.0
h            0.0      56.0

Or using set_index and sum with level parameter then unstack:

df.set_index(['Name','Source'])['Sales'].sum(level=[0,1]).unstack(fill_value=0) 

Output:

Source  Source A  Source B
Name                      
a             34        13
b           3423        64
c             36         9
d            342         0
e            234         0
f            234         0
g             23         0
h              0        56

Upvotes: 3

Related Questions