Reputation: 491
I have a table named monthly_agg which contains monthly aggregated data.
+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+
| yyyy_mm_dd | id | app | ex_status | active_status | active_count | active_base | ex_count | ex_base |
+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+
| 2019-01-31 | 123 | content | impl | impl | 390 | 321 | 344 | 340 |
+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+
| 2019-01-31 | 333 | messages | impl | impl | 541 | 210 | 788 | 610 |
+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+
| 2019-01-31 | 832 | photos | no | no | null | 430 | null | 100 |
+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+
I want to make each app, be a column. Each app column should contain a percentage, which is calculated as follows:
df=spark.sql("""
SELECT
yyyy_mm_dd,
id,
app,
SUM(CASE
WHEN (app = ‘content’ AND ex_status = ‘impl’) THEN ex_count/ex_base
WHEN (active_status = 'impl') THEN active_count/active_base
END) AS percentage
FROM
monthly_agg
""")
I need to have each app
value to be be a column and then the value of that column be the result of the above calculation. How could I pivot the table this way using Pandas instead of HQL? ideally my output df would look like this:
+------------+-----+--------------------+---------------------+
| yyyy_mm_dd | id | content_percentage | messages_percentage |
+------------+-----+--------------------+---------------------+
| 2019-01-31 | 123 | 1.2 | null |
+------------+-----+--------------------+---------------------+
| 2019-01-31 | 333 | null | 2.57 |
+------------+-----+--------------------+---------------------+
I have about 20 apps so being dynamic would be great.
Upvotes: 1
Views: 955
Reputation: 863166
Use numpy.select
for percentages and then use DataFrame.pivot_table
:
m1 = (df.app == 'content') & (df.ex_status == 'impl')
m2 = df.active_status == 'impl'
s1 = df.ex_count / df.ex_base
s2 = df.ex_count / df.active_base
df['percentage'] = np.select([m1, m2], [s1,s2], np.nan)
df1 = (df.pivot_table(index=['yyyy_mm_dd','id'],
columns='app',
values='percentage',
aggfunc=lambda x: x.sum(min_count=1))
.add_suffix('_percentage')
.reset_index())
print (df1)
app yyyy_mm_dd id content_percentage messages_percentage
0 20190131 123 1.011765 NaN
1 20190131 333 NaN 3.752381
EDIT:
print (m1)
0 True
1 False
2 False
dtype: bool
print (m2)
0 True
1 True
2 False
Name: active_status, dtype: bool
print (s1)
0 1.011765
1 1.291803
2 NaN
dtype: float64
print (s2)
0 1.071651
1 3.752381
2 NaN
dtype: float64
Upvotes: 1