Reputation: 203
I need to normalize the rows of a dataframe containing rows populated with all zero. For example:
df= pd.DataFrame({"ID": ['1', '2', '3', '4'], "A": [1, 0, 10, 0], "B": [4, 0, 30, 0]})
ID A B
1 1 4
2 0 0
3 10 30
4 0 0
My approach is to first exclude the zero-value rows followed by normalizing the non-zero subset using:
df1 = df[df.sum(axis=1) != 0]
df2 = df[df.sum(axis=1) == 0]
sum_row = df1.sum(axis=1)
df1.div(sum_row, axis=0)
and then concatenate the two dataframes as follows:
pd.concat([df1, df2]).reset_index()
However, I end up with the following error while applying df1.div(sum_row, axis=0)
ValueError: operands could not be broadcast together with shapes (6,) (2,)
I wonder how to fix the error and if there exists a more efficient approach. Thanks!
Edit: The resulting dataframe is expected to look like as:
ID A B
1 0.2 0.8
2 0 0
3 0.25 0.75
4 0 0
Upvotes: 5
Views: 14170
Reputation: 2684
Use div:
df= pd.DataFrame({"ID": ['1', '2', '3', '4'], "A": [1, 0, 10, 0], "B": [4, 0, 30, 0]})
df.set_index("ID", inplace=True)
df.div(df.sum(axis=1), axis=0).fillna(0)
Upvotes: 4
Reputation: 323386
Using melt
with crosstab
newdf=df.melt('ID')
pd.crosstab(index=newdf.ID,columns=newdf.variable,values=newdf.value,normalize='index',aggfunc='mean')
Out[447]:
variable A B
ID
1 0.20 0.80
2 0.00 0.00
3 0.25 0.75
4 0.00 0.00
Upvotes: 1
Reputation: 36619
You can use Normalizer in scikit-learn
df= pd.DataFrame({"ID": ['1', '2', '3', '4'], "A": [1, 0, 10, 0], "B": [4, 0, 30, 0]})
df = df.set_index('ID')
from sklearn.preprocessing import Normalizer
df.iloc[:,:] = Normalizer(norm='l1').fit_transform(df)
print(df)
A B
ID
1 0.20 0.80
2 0.00 0.00
3 0.25 0.75
4 0.00 0.00
Upvotes: 7