Hunter Mitchell
Hunter Mitchell

Reputation: 399

How to group unioned dataframe to combine same rows

I have just unioned two dataframes in pyspark and instead of it combining the rows with the same dates, it stacked them on top of each other like so:

df1:

+----------+------------+--------------+
|      date| bounceCount|  captureCount|
+----------+------------+--------------+ 
|  20190518|           2|          null|
|  20190521|           1|          null|
|  20190519|           1|          null|
|  20190522|           1|          null|
+----------+------------+--------------+

df2:

+----------+-------------+-------------+
|      date| captureCount|  bounceCount|
+----------+-------------+-------------+ 
|  20190516|         null|            3|
|  20190518|         null|            2|
|  20190519|         null|            1|
|  20190524|         null|            5|
+----------+-------------+-------------+

union:

+----------+------------+--------------+
|      date| bounceCount|  captureCount|
+----------+------------+--------------+ 
|  20190518|           2|          null|
|  20190521|           1|          null|
|  20190519|           1|          null|
|  20190522|           1|          null|
|  20190516|        null|             3|
|  20190518|        null|             2|
|  20190519|        null|             1|
|  20190524|        null|             5|
+----------+------------+--------------+

I would like it to group it so that the rows with the same dates get combined with the correct bounceCount and captureCount:

+----------+------------+--------------+
|      date| bounceCount|  captureCount|
+----------+------------+--------------+ 
|  20190518|           2|             2|
|  20190521|           1|          null|
|  20190519|           1|             1|
|  20190522|           1|          null|
|  20190516|        null|             3|
|  20190524|        null|             5|
+----------+------------+--------------+

I have tried putting them together in different ways, and grouping the dataframe in different ways, but I cannot figure it. I will also be attaching this dataframe with several other columns, so I would like to know the best way to do this. Anyone know a simple way of doing this?

Upvotes: 1

Views: 859

Answers (2)

Shantanu Sharma
Shantanu Sharma

Reputation: 4089

Try this -

Join(full) both dataframes and use coalesce function.

from pyspark.sql.functions import coalesce

joining_condition = [df1.date == df2.date]

df1\
    .join(df2,joining_condition,'full')\
    .select(coalesce(df1.date,df2.date).alias('date')
            ,df1.bounceCount
            ,df2.bounceCount.alias('captureCount'))\
    .show()

#+--------+-----------+------------+
#|    date|bounceCount|captureCount|
#+--------+-----------+------------+
#|20190518|          2|           2|
#|20190519|          1|           1|
#|20190521|          1|        null|
#|20190524|       null|           5|
#|20190522|          1|        null|
#|20190516|       null|           3|
#+--------+-----------+------------+

I think columns of df2 dataframe got interchanged. Please check. if that's the case change the column names in solution.

Upvotes: 1

Louis Yang
Louis Yang

Reputation: 3831

You can achieve this by outer join.

df = (
    df1.select('date', 'bounceCount')
    .join(
        df2.select('date', 'captureCount'),
        on='data', how='outer'
    )
)

Upvotes: 1

Related Questions