Reputation: 12406
I am trying to normalize some columns of a Pandas DataFrame in Python to their sum. I have the following the DataFrame:
import pandas as pd
l_a_2015 = ['Farh','Rob_Sens','Pressure','Septic',10.0,45.,52.,72.51]
l_a_2010 = ['Water_Column','Log','Humid','Top_Tank',58.64,35.42,10.,30.]
df = pd.DataFrame([l_a_2010,l_a_2015],columns=['Output_A','Tonnes_Rem',
'Log_Act_All','Readout','A1','A2','A3','A4'])
I would like to normalize the columns A1
,A2
,A3
,A4
to their sum as shown here - divide each element on a row by the sum of 4 elements.
The first part of this appears to work fine - I get the sum of the last 4 columns, on each row, with this:
x,y = df.sum(axis=1).tolist()
So, the list [x,y]
gives me the sum of the first and second rows (last 4 columns). However, when I try to divide all DataFrame entries on each row by the sum of that row, then I am having problems:
for b,n in enumerate([x,y]):
for f,elem in enumerate(list(df)[4:]):
df.iloc[b,f] = (df.iloc[b,f]/n)*100.
I get the following error:
[Traceback (most recent call last):134.06, 179.50999999999999]
File "C:\test.py", line 13, in <module>
df.iloc[b,f] = (df.iloc[b,f]/n)*100.
TypeError: ufunc 'divide' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
When I use print df.dtypes
I am getting float64
for all the columns so I am not sure why the division is not safe.
Is there
Upvotes: 1
Views: 407
Reputation: 210882
try this:
In [5]: df
Out[5]:
Output_A Tonnes_Rem Log_Act_All Readout A1 A2 A3 A4
0 Water_Column Log Humid Top_Tank 58.64 35.42 10.0 30.00
1 Farh Rob_Sens Pressure Septic 10.00 45.00 52.0 72.51
In [8]: cols = df.select_dtypes(include=['number']).columns.tolist()
In [9]: cols
Out[9]: ['A1', 'A2', 'A3', 'A4']
let's create a view with numeric columns only:
In [10]: v = df[cols]
In [13]: df[cols] = v.div(v.sum(axis=1), 0)
In [14]: df
Out[14]:
Output_A Tonnes_Rem Log_Act_All Readout A1 A2 A3 A4
0 Water_Column Log Humid Top_Tank 0.437416 0.264210 0.074593 0.223780
1 Farh Rob_Sens Pressure Septic 0.055707 0.250682 0.289677 0.403933
an alternative way to select A*
columns:
In [18]: df.filter(regex='^A\d+')
Out[18]:
A1 A2 A3 A4
0 0.437416 0.264210 0.074593 0.223780
1 0.055707 0.250682 0.289677 0.403933
In [19]: df.filter(regex='^A\d+').columns
Out[19]: Index(['A1', 'A2', 'A3', 'A4'], dtype='object')
Upvotes: 2