Reputation: 459
What is the most straightforward method to replace values in some columns by their percentage in a row?
Example:
from this
to this
I have tried the code:
cols=['h1', 'h2', 'h3', 'hn']
df[cols]=df[cols]/df['sum']
but this returns the error:
ValueError: Columns must be same length as key.
In addition, i do not think this is the best way as i may have many more columns than 4.
Upvotes: 2
Views: 1847
Reputation: 863031
Use DataFrame.div
with specify axis=0
:
cols=['h1', 'h2', 'h3', 'hn']
df[cols]=df[cols].div(df['sum'], axis=0)
If sum
is last column is possible use:
df.iloc[:, :-1]=df.iloc[:, :-1].div(df['sum'], axis=0)
Sample:
df = pd.DataFrame({
'h1':[4,5,4,5,5,4],
'h2':[7,8,9,4,2,3],
'h3':[1,3,5,7,1,0],
'hn':[4,5,4,5,5,4],
})
df['sum'] = df.sum(axis=1)
df.iloc[:, :-1] = df.iloc[:, :-1].div(df['sum'], axis=0)
print (df)
h1 h2 h3 hn sum
0 0.250000 0.437500 0.062500 0.250000 16
1 0.238095 0.380952 0.142857 0.238095 21
2 0.181818 0.409091 0.227273 0.181818 22
3 0.238095 0.190476 0.333333 0.238095 21
4 0.384615 0.153846 0.076923 0.384615 13
5 0.363636 0.272727 0.000000 0.363636 11
Performance:
np.random.seed(2019)
N = 10000
df = pd.DataFrame(np.random.randint(100, size=(N, 20))).add_prefix('h')
df['sum'] = df.sum(axis=1)
print (df)
In [220]: %%timeit
...: df.iloc[:, :-1]=df.iloc[:, :-1].div(df['sum'], axis=0)
...:
8.03 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
C:\Anaconda\lib\site-packages\spyder\widgets\variableexplorer\utils.py:410: FutureWarning: 'summary' is deprecated and will be removed in a future version.
display = value.summary()
In [221]: %%timeit
...: for col in df.columns[:-1]:
...: df[col] /= df["sum"]
...:
9.46 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
C:\Anaconda\lib\site-packages\spyder\widgets\variableexplorer\utils.py:410: FutureWarning: 'summary' is deprecated and will be removed in a future version.
display = value.summary()
In [222]: %%timeit
...: df.iloc[:,:-1] = df.iloc[:,:-1].apply(lambda x: x/sum(x), axis=1)
...:
2.51 s ± 194 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
C:\Anaconda\lib\site-packages\spyder\widgets\variableexplorer\utils.py:410: FutureWarning: 'summary' is deprecated and will be removed in a future version.
display = value.summary()
Upvotes: 3
Reputation: 8923
Another solution:
import pandas as pd
df = pd.DataFrame({
'h1':[56,4,4,203],
'h2':[89,67,3,4],
'h3':[45,23,6,46],
'hn':[32,44,78,78],
'sum':[222,138,91,331],
})
df.iloc[:,:-1] = df.iloc[:,:-1].apply(lambda x: x/sum(x), axis=1)
And you get:
h1 h2 h3 hn sum
0 0.252252 0.400901 0.202703 0.144144 222
1 0.028986 0.485507 0.166667 0.318841 138
2 0.043956 0.032967 0.065934 0.857143 91
3 0.613293 0.012085 0.138973 0.235650 331
Note this solution is much slower than jezrael's one.
%timeit df.iloc[:,:-1] = df.iloc[:,:-1].apply(lambda x: x/sum(x), axis=1)
%timeit df.iloc[:, :-1]=df.iloc[:, :-1].div(df['sum'], axis=0)
On a normal laptop it gives me:
3.33 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.27 ms ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Almost three times slower.
Upvotes: 0
Reputation: 765
Maybe slightly more readable:
for col in df.columns[:-1]:
df[col] /= df["sum"]
Upvotes: 2