Reputation: 3754
I'm using crosstab
to sum the sales in given areas by the Publisher.
The original dataframe looks like this:
Publisher NA_Sales EU_Sales JP_Sales
1 Nintendo 29.08 3.58 6.81
2 Nintendo 15.68 12.76 3.79
3 Nintendo 15.61 10.93 3.28
4 Nintendo 11.27 8.89 10.22
5 Nintendo 23.20 2.26 4.22
I did it with pivot table now I want to do it using crosstab.
salespivot1=pd.pivot_table(df, index=df.Publisher,
aggfunc=np.sum).sort_values('NA_Sales', ascending=False)
creates:
EU_Sales JP_Sales NA_Sales
Publisher
Nintendo 390.05 454.38 775.61
Electronic Arts 373.91 14.35 599.50
Activision 215.90 6.71 432.59
Sony Computer Entertainment 186.56 74.15 266.17
Ubisoft 161.99 7.52 252.74
But using crosstab I cant recreate this dataframe, because it stacks EU_Sales on top of the NA_Sales no matter what I do
salespivot3=pd.crosstab(index=df.Publisher, columns=['NA_Sales', 'EU_Sales'],
values=df.NA_Sales, aggfunc=sum)
creates:
col_0 NA_Sales
col_1 EU_Sales
Nintendo 775.61
Electronic Arts 599.50
Activision 432.59
Sony Computer Entertainment 266.17
Ubisoft 252.74
How can I recreate the dataframe with crosstab to give same results as pivot?
Upvotes: 2
Views: 10606
Reputation: 210852
I think your are "misusing" pivot_table
and crosstab
methods.
Pivot methods suppose to transform your data set from long format to wide format.
Here is a small demo:
Source DF:
In [42]: df
Out[42]:
A B C D
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7
pivot_table
usage:
In [43]: df.pivot_table(values='D', index=['A', 'B'], columns=['C'], aggfunc='sum')
Out[43]:
C large small
A B
bar one 4.0 5.0
two 7.0 6.0
foo one 4.0 1.0
two NaN 6.0
now the same result using pd.crosstab()
method:
In [44]: pd.crosstab(index=[df.A,df.B], columns=df.C, values=df.D, aggfunc='sum')
Out[44]:
C large small
A B
bar one 4.0 5.0
two 7.0 6.0
foo one 4.0 1.0
two NaN 6.0
for your sample data set you simply want to groupby
+ sum
:
In [46]: df
Out[46]:
Publisher NA_Sales EU_Sales JP_Sales
1 Nintendo 29.08 3.58 6.81
2 Nintendo 15.68 12.76 3.79
3 Nintendo 15.61 10.93 3.28
4 Nintendo 11.27 8.89 10.22
5 Nintendo 23.20 2.26 4.22
In [47]: df.groupby('Publisher', as_index=False).sum()
Out[47]:
Publisher NA_Sales EU_Sales JP_Sales
0 Nintendo 94.84 38.42 28.32
UPDATE: if you by all means want to generate the same data set using crosstab
method you can do it this way:
In [63]: x = df.set_index('Publisher').stack().reset_index(name='val')
In [64]: x
Out[64]:
Publisher level_1 val
0 Nintendo NA_Sales 29.08
1 Nintendo EU_Sales 3.58
2 Nintendo JP_Sales 6.81
3 Nintendo NA_Sales 15.68
4 Nintendo EU_Sales 12.76
5 Nintendo JP_Sales 3.79
6 Nintendo NA_Sales 15.61
7 Nintendo EU_Sales 10.93
8 Nintendo JP_Sales 3.28
9 Nintendo NA_Sales 11.27
10 Nintendo EU_Sales 8.89
11 Nintendo JP_Sales 10.22
12 Nintendo NA_Sales 23.20
13 Nintendo EU_Sales 2.26
14 Nintendo JP_Sales 4.22
In [65]: pd.crosstab(index=x.Publisher, columns=x.level_1, values=x.val, aggfunc='sum')
Out[65]:
level_1 EU_Sales JP_Sales NA_Sales
Publisher
Nintendo 38.42 28.32 94.84
NOTE: we had first to convert your original DF from wide to long format and then back to wide format using pd.crosstab
Upvotes: 3
Reputation: 29711
It's not possible to use pd.crosstab()
directly on your current DF
unless you reshape them from wide to a long format so that the resulting headers would later serve as subsequent parameters to be passed into it's function call.
Here's a slight hack:
idx = ["Publisher"]
d = pd.melt(df, id_vars=idx)
pd.crosstab(d.Publisher, d.variable, d.value, aggfunc="sum", rownames=idx, colnames=[None])
But honestly, you should be using either a groupby
/pivot_table
approach which is designed for this exact purpose.
Upvotes: 8