chendoy
chendoy

Reputation: 275

Aggregating on 2 columns and turning into columns on another

I have the following dataframe of HTTP requests with columns IP, Timestamp, user_agent and hostname:

IP Timestamp user_agent hostname
1.1.1.1 1656758570 ua1 hostname1
1.1.1.1 1656758580 ua2 hostname2
1.1.1.1 1656758570 ua3 hostname1
1.1.1.1 1656758580 ua1 hostname3
2.2.2.2 1656758580 ua3 hostname2
2.2.2.2 1656758580 ua1 hostname1
2.2.2.2 1656758590 ua3 hostname1
2.2.2.2 1656758590 ua2 hostname3
2.2.2.2 1656758590 ua1 hostname2

I want to group the data by the IP column, then inside each group to aggregate on the Timestamp column, calculate countDistinct values for the other columns (user_agent and hostname) that have the same Timestamp value, and then transform the aggregated data into new columns. Namely, I want the result dataframe to look like this:

IP 1656758570_uniue_user_agent 1656758570_uniue_hostname 1656758580_uniue_user_agent 1656758580_uniue_hostname 1656758590_uniue_user_agent 1656758590_uniue_hostname
1.1.1.1 2 1 2 2 0 0
2.2.2.2 0 0 2 2 3 3

Is there a convenient way to do this instead of multiple withColumn calls?

Upvotes: 0

Views: 64

Answers (2)

wwnde
wwnde

Reputation: 26676

Use list comprehension to pass a wildcard and pivot the count. Code below;

(df
    .groupBy('IP')
    .pivot('Timestamp')
    .agg(*[F.countDistinct(col(x)).alias(f'unique_{x}') for x in df.drop('Timestamp').columns]
        )
    .fillna(0)
).show()



   +-------+--------------------+----------------------------+--------------------------+--------------------+----------------------------+--------------------------+--------------------+----------------------------+--------------------------+
|     IP|1656758570_unique_IP|1656758570_unique_user_agent|1656758570_unique_hostname|1656758580_unique_IP|1656758580_unique_user_agent|1656758580_unique_hostname|1656758590_unique_IP|1656758590_unique_user_agent|1656758590_unique_hostname|
+-------+--------------------+----------------------------+--------------------------+--------------------+----------------------------+--------------------------+--------------------+----------------------------+--------------------------+
|2.2.2.2|                   0|                           0|                         0|                   1|                           2|                         2|                   1|                           3|                         3|
|1.1.1.1|                   1|                           2|                         1|                   1|                           2|                         2|                   0|                           0|                         0|
+-------+--------------------+----------------------------+--------------------------+--------------------+----------------------------+--------------------------+--------------------+----------------------------+--------------------------+

Upvotes: 1

ZygD
ZygD

Reputation: 24386

You can insert a pivot between groupBy and agg

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('1.1.1.1', 1656758570, 'ua1', 'hostname1'),
     ('1.1.1.1', 1656758580, 'ua2', 'hostname2'),
     ('1.1.1.1', 1656758570, 'ua3', 'hostname1'),
     ('1.1.1.1', 1656758580, 'ua1', 'hostname3'),
     ('2.2.2.2', 1656758580, 'ua3', 'hostname2'),
     ('2.2.2.2', 1656758580, 'ua1', 'hostname1'),
     ('2.2.2.2', 1656758590, 'ua3', 'hostname1'),
     ('2.2.2.2', 1656758590, 'ua2', 'hostname3'),
     ('2.2.2.2', 1656758590, 'ua1', 'hostname2')],
    ['IP', 'Timestamp', 'user_agent', 'hostname'])

df = (df
    .groupBy('IP')
    .pivot('Timestamp')
    .agg(F.countDistinct('user_agent').alias('unique_user_agent'),
         F.countDistinct('hostname').alias('unique_hostname'))
    .fillna(0)
)

df.show()
# +-------+----------------------------+--------------------------+----------------------------+--------------------------+----------------------------+--------------------------+
# |     IP|1656758570_unique_user_agent|1656758570_unique_hostname|1656758580_unique_user_agent|1656758580_unique_hostname|1656758590_unique_user_agent|1656758590_unique_hostname|
# +-------+----------------------------+--------------------------+----------------------------+--------------------------+----------------------------+--------------------------+
# |2.2.2.2|                           0|                         0|                           2|                         2|                           3|                         3|
# |1.1.1.1|                           2|                         1|                           2|                         2|                           0|                         0|
# +-------+----------------------------+--------------------------+----------------------------+--------------------------+----------------------------+--------------------------+

Upvotes: 1

Related Questions