Reputation: 1114
I have a DataFrame of dimension n x m. I would like to return a DataFrame of dimension n x m where each cell represents the percentage of the total along the row it exists.
For Example,
df=sc.parallelize([
('a1',15,2,0,3),
('a2',3,9,5,3),
('a2',4,10,4,2),
('a1',0,10,7,3)
]).toDF(['id1','x1','x2','x3','x4'])
id1| x1| x2| x3| x4|
| a1| 15| 2| 0| 3|
| a2| 3| 9| 5| 3|
| a2| 4| 10| 4| 2|
| a1| 0| 10| 7| 3|
I would like to return
id1| x1| x2| x3| x4|
| a1| .75| .1 | .0| .15|
| a2| .15| .45| .25| .15|
| a2| .2 | .5 | .2 | .1 |
| a1| .0 | .5 | .35| .15|
Upvotes: 1
Views: 1325
Reputation: 330083
It is pretty simple. Compute sum per row:
total = sum(df[c] for c in df.columns[1:])
and select
:
df.select(df.columns[0], *[(df[c] / total).alias(c) for c in df.columns[1:]])
Upvotes: 2