Efficient way to pivot table dynamically

Question

I have a table named monthly_agg which contains monthly aggregated data.

+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+
| yyyy_mm_dd | id  | app      | ex_status | active_status | active_count | active_base | ex_count | ex_base |
+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+
| 2019-01-31 | 123 | content  | impl      | impl          | 390          | 321         | 344      | 340     |
+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+
| 2019-01-31 | 333 | messages | impl      | impl          | 541          | 210         | 788      | 610     |
+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+
| 2019-01-31 | 832 | photos   | no        | no            | null         | 430         | null     | 100     |
+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+

I want to make each app, be a column. Each app column should contain a percentage, which is calculated as follows:

df=spark.sql("""
    SELECT 
        yyyy_mm_dd,
        id,
        app,
        SUM(CASE
            WHEN (app = ‘content’ AND ex_status = ‘impl’) THEN ex_count/ex_base
            WHEN (active_status = 'impl') THEN active_count/active_base
        END) AS percentage
    FROM 
        monthly_agg
""")

I need to have each app value to be be a column and then the value of that column be the result of the above calculation. How could I pivot the table this way using Pandas instead of HQL? ideally my output df would look like this:

+------------+-----+--------------------+---------------------+
| yyyy_mm_dd | id  | content_percentage | messages_percentage |
+------------+-----+--------------------+---------------------+
| 2019-01-31 | 123 | 1.2                | null                |
+------------+-----+--------------------+---------------------+
| 2019-01-31 | 333 | null               | 2.57                |
+------------+-----+--------------------+---------------------+

I have about 20 apps so being dynamic would be great.

jezrael · Accepted Answer

Use numpy.select for percentages and then use DataFrame.pivot_table:

m1 = (df.app == 'content') & (df.ex_status == 'impl')
m2 = df.active_status == 'impl'
s1 = df.ex_count / df.ex_base
s2 = df.ex_count / df.active_base
df['percentage'] = np.select([m1, m2], [s1,s2], np.nan)

df1 = (df.pivot_table(index=['yyyy_mm_dd','id'],
                      columns='app',
                      values='percentage',
                      aggfunc=lambda x: x.sum(min_count=1))
        .add_suffix('_percentage')
        .reset_index())
print (df1)
app  yyyy_mm_dd   id  content_percentage  messages_percentage
0      20190131  123            1.011765                  NaN
1      20190131  333                 NaN             3.752381

EDIT:

print (m1)
0     True
1    False
2    False
dtype: bool

print (m2)
0     True
1     True
2    False
Name: active_status, dtype: bool

print (s1)
0    1.011765
1    1.291803
2         NaN
dtype: float64

print (s2)
0    1.071651
1    3.752381
2         NaN
dtype: float64

Efficient way to pivot table dynamically

Answers (1)

Related Questions