Reputation: 2061
Could someone help me to achieve this task? I have data in multi-level data frame through the unstack() operation:
Original df:
Density Length Range Count
15k 0.60 small 555
15k 0.60 big 17
15k 1.80 small 141
15k 1.80 big 21
15k 3.60 small 150
15k 3.60 big 26
20k 0.60 small 5543
20k 0.60 big 22
20k 1.80 small 553
20k 1.80 big 25
20k 3.60 small 422
20k 3.60 big 35
df = df.set_index(['Density','Length','Range']).unstack('Range')
# After unstack:
Count
Range big small
Density Length
15k 0.60 17 555
1.80 21 141
3.60 26 150
20k 0.60 22 5543
1.80 25 553
3.60 35 422
Now I try to add an extra column in level 1. it is the ratio of the small/big. I have tried the following syntax, no error but with different outcomes
#df[:]['ratio']=df['Count']['small']/df['Count']['big'] ## case 1. no error, no ratio
#df['Count']['ratio']=df['Count']['small']/df['Count']['big'] ## case 2. no error, no ratio
#df['ratio']=df['Count']['small']/df['Count']['big'] ## case 3. no error, ratio on column level 0
df['ratio']=df.ix[:,1]/df.ix[:,0] ## case 4. no error, ratio on column level 0
#After execution above code, df:
Count ratio
Range big small
Density Length
15k 0.60 17 555 32.65
1.80 21 141 6.71
3.60 26 150 5.77
20k 0.60 22 5543 251.95
1.80 25 553 22.12
3.60 35 422 12.06
I don't understand why case 1 & 2 show no error neither adding new ratio column. and why in case 3 & 4 the ratio column is on level 0, not the expected level 1. Also like to know if there is a better/concise way to achieve this. Case 4 is the best I can do but I don't like the implicit indexing way (instead of using the name) to refer to a column.
Thanks
Upvotes: 3
Views: 4729
Reputation: 880907
Case 1:
df[:]['ratio']=df['Count']['small']/df['Count']['big']
df[:]
is a copy of df
. They are different objects, each with its own copy of the underlying data:
In [69]: df[:] is df
Out[69]: False
So modifying the copy has no effect on the original df. Since no reference is
maintained for df[:]
, the object is garbage collected after the assignment,
making the assignment useless.
Case 2:
df['Count']['ratio']=df['Count']['small']/df['Count']['big']
uses chain-indexing. Avoid chain indexing when making assignments. The link explains why assignments using chain-indexing on the left-hand side may not affect df
.
If you set
pd.options.mode.chained_assignment = 'warn'
then Pandas will warn you not to use chain-indexing in assignments:
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Case 3:
df['ratio']=df['Count']['small']/df['Count']['big']
and Case 4
df['ratio']=df.ix[:,1]/df.ix[:,0]
both work, but it could done more efficently using
df['ratio'] = df['Count','small']/df['Count','big']
Here is a microbenchmark showing that using df[tuple_index]
is faster than
chain-indexing:
In [99]: %timeit df['Count']['small']
1000 loops, best of 3: 501 µs per loop
In [128]: %timeit df['Count','small']
100000 loops, best of 3: 8.91 µs per loop
If you want ratio
to be the level 1 label, then you must tell Pandas that the level 0 label is Count
. You can do that by assigning to df['Count','ratio']
:
In [96]: df['Count','ratio'] = df['Count']['small']/df['Count','big']
# In [97]: df
# Out[97]:
# Count
# Range big small ratio
# Density Length
# 15k 0.6 17 555 32.647059
# 1.8 21 141 6.714286
# 3.6 26 150 5.769231
# 20k 0.6 22 5543 251.954545
# 1.8 25 553 22.120000
# 3.6 35 422 12.057143
Upvotes: 5