Reputation: 18810
I have read a csv file and pivoted it to get to following structure:
pivoted = df.pivot('user_id', 'group', 'value')
lookup = df.drop_duplicates('user_id')[['user_id', 'group']]
lookup.set_index(['user_id'], inplace=True)
result = pivoted.join(lookup)
result = result.fillna(0)
Section of the result:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 group
user_id
2 33653 2325 916 720 867 187 31 0 6 3 42 56 92 15 l-1
4 18895 414 1116 570 1190 55 92 0 122 23 78 6 4 2 l-2
16 1383 70 27 17 17 1 0 0 0 0 1 0 0 0 l-2
50 396 72 34 5 18 0 0 0 0 0 0 0 0 0 l-3
51 3915 1170 402 832 2791 316 12 5 118 51 32 9 62 27 l-4
I want to sum across column 0 to column 13 by each row and divide each cell by the sum of that row. I am still getting used to pandas; if I understand correctly, we should try to avoid for loops when doing things like this? In other words, how can I do this in a 'pandas' way?
Upvotes: 78
Views: 103851
Reputation: 10213
The most popular answer is what you should use BUT I'll toss a NumPy-based solution into the mix since this will also work if you have NumPy matrixes, not just Pandas DataFrames.
np.divide(df, df.sum(axis=1).values.reshape(df.shape[0],1))
np.divide()
divides 1 matrix by another. In other words, both need to be matrixesdf.sum(axis=1)
returns a vector which is a problem for np.divide
.values.reshape(df.shape[0],1)
reshapes vector of row-sums into a nx1 matrixUpvotes: 0
Reputation: 1
result.iloc[:,:-1].div(result.iloc[:,:-1].sum(axis=1), axis=0)
result.iloc[:,:-1]
gets all rows and columns except last column
result.iloc[:,:-1].sum(axis=1)
sums across a row due to axis=1, default is axis=0
i.e. column
result.div(result, axis=0)
axis=0
because default for div is column i.e. axis=1
Upvotes: 0
Reputation: 4483
easier to work per column:
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
(df.T / df.T.sum()).T
result:
0 1 2
0 0.166667 0.333333 0.500
1 0.266667 0.333333 0.400
2 0.291667 0.333333 0.375
Upvotes: 13
Reputation: 14169
Try the following:
In [1]: import pandas as pd
In [2]: df = pd.read_csv("test.csv")
In [3]: df
Out[3]:
id value1 value2 value3
0 A 1 2 3
1 B 4 5 6
2 C 7 8 9
In [4]: df["sum"] = df.sum(axis=1)
In [5]: df
Out[5]:
id value1 value2 value3 sum
0 A 1 2 3 6
1 B 4 5 6 15
2 C 7 8 9 24
In [6]: df_new = df.loc[:,"value1":"value3"].div(df["sum"], axis=0)
In [7]: df_new
Out[7]:
value1 value2 value3
0 0.166667 0.333333 0.500
1 0.266667 0.333333 0.400
2 0.291667 0.333333 0.375
Or you can do the following:
In [8]: df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df["sum"], axis=0)
In [9]: df
Out[9]:
id value1 value2 value3 sum
0 A 0.166667 0.333333 0.500 6
1 B 0.266667 0.333333 0.400 15
2 C 0.291667 0.333333 0.375 24
Or just straight up from the beginning:
In [10]: df = pd.read_csv("test.csv")
In [11]: df
Out[11]:
id value1 value2 value3
0 A 1 2 3
1 B 4 5 6
2 C 7 8 9
In [12]: df.loc[:,"value1":"value3"] = df.loc[:,"value1":"value3"].div(df.sum(axis=1), axis=0)
In [13]: df
Out[13]:
id value1 value2 value3
0 A 0.166667 0.333333 0.500
1 B 0.266667 0.333333 0.400
2 C 0.291667 0.333333 0.375
Changing the column value1
and the like to your headers should work similarly.
Upvotes: 63
Reputation: 394051
The following seemed to work fine for me:
In [39]:
cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
result[cols] = result[cols].apply(lambda row: row / row.sum(axis=1), axis=1)
result
Out[39]:
0 1 2 3 4 5 6 \
user_id
2 0.864827 0.059749 0.023540 0.018503 0.022280 0.004806 0.000797
4 0.837285 0.018345 0.049453 0.025258 0.052732 0.002437 0.004077
16 0.912269 0.046174 0.017810 0.011214 0.011214 0.000660 0.000000
50 0.754286 0.137143 0.064762 0.009524 0.034286 0.000000 0.000000
51 0.401868 0.120099 0.041265 0.085403 0.286491 0.032437 0.001232
7 8 9 10 11 12 13 \
user_id
2 0.000000 0.000154 0.000077 0.001079 0.001439 0.002364 0.000385
4 0.000000 0.005406 0.001019 0.003456 0.000266 0.000177 0.000089
16 0.000000 0.000000 0.000000 0.000660 0.000000 0.000000 0.000000
50 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
51 0.000513 0.012113 0.005235 0.003285 0.000924 0.006364 0.002772
group
user_id
2 l-1
4 l-2
16 l-2
50 l-3
51 l-4
OK scratch the above, the following will be much faster:
result[cols] = result[cols].div(result[cols].sum(axis=1), axis=0)
And just to prove the result is the same:
In [47]:
cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
np.alltrue(result[cols].div(result[cols].sum(axis=1), axis=0) == result[cols].apply(lambda row: row / row.sum(axis=1), axis=1))
Out[47]:
True
And that it's faster:
In [48]:
cols = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13']
%timeit result[cols].div(result[cols].sum(axis=1), axis=0)
%timeit result[cols].apply(lambda row: row / row.sum(axis=1), axis=1)
100 loops, best of 3: 2.38 ms per loop
100 loops, best of 3: 4.47 ms per loop
Upvotes: 7