How to group unioned dataframe to combine same rows

Question

I have just unioned two dataframes in pyspark and instead of it combining the rows with the same dates, it stacked them on top of each other like so:

df1:

+----------+------------+--------------+
|      date| bounceCount|  captureCount|
+----------+------------+--------------+ 
|  20190518|           2|          null|
|  20190521|           1|          null|
|  20190519|           1|          null|
|  20190522|           1|          null|
+----------+------------+--------------+

df2:

+----------+-------------+-------------+
|      date| captureCount|  bounceCount|
+----------+-------------+-------------+ 
|  20190516|         null|            3|
|  20190518|         null|            2|
|  20190519|         null|            1|
|  20190524|         null|            5|
+----------+-------------+-------------+

union:

+----------+------------+--------------+
|      date| bounceCount|  captureCount|
+----------+------------+--------------+ 
|  20190518|           2|          null|
|  20190521|           1|          null|
|  20190519|           1|          null|
|  20190522|           1|          null|
|  20190516|        null|             3|
|  20190518|        null|             2|
|  20190519|        null|             1|
|  20190524|        null|             5|
+----------+------------+--------------+

I would like it to group it so that the rows with the same dates get combined with the correct bounceCount and captureCount:

+----------+------------+--------------+
|      date| bounceCount|  captureCount|
+----------+------------+--------------+ 
|  20190518|           2|             2|
|  20190521|           1|          null|
|  20190519|           1|             1|
|  20190522|           1|          null|
|  20190516|        null|             3|
|  20190524|        null|             5|
+----------+------------+--------------+

I have tried putting them together in different ways, and grouping the dataframe in different ways, but I cannot figure it. I will also be attaching this dataframe with several other columns, so I would like to know the best way to do this. Anyone know a simple way of doing this?

Louis Yang · Accepted Answer

You can achieve this by outer join.

df = (
    df1.select('date', 'bounceCount')
    .join(
        df2.select('date', 'captureCount'),
        on='data', how='outer'
    )
)

How to group unioned dataframe to combine same rows

df1:

df2:

union:

Answers (2)

Related Questions