Reputation: 399
I have just unioned two dataframes in pyspark and instead of it combining the rows with the same dates, it stacked them on top of each other like so:
+----------+------------+--------------+
| date| bounceCount| captureCount|
+----------+------------+--------------+
| 20190518| 2| null|
| 20190521| 1| null|
| 20190519| 1| null|
| 20190522| 1| null|
+----------+------------+--------------+
+----------+-------------+-------------+
| date| captureCount| bounceCount|
+----------+-------------+-------------+
| 20190516| null| 3|
| 20190518| null| 2|
| 20190519| null| 1|
| 20190524| null| 5|
+----------+-------------+-------------+
+----------+------------+--------------+
| date| bounceCount| captureCount|
+----------+------------+--------------+
| 20190518| 2| null|
| 20190521| 1| null|
| 20190519| 1| null|
| 20190522| 1| null|
| 20190516| null| 3|
| 20190518| null| 2|
| 20190519| null| 1|
| 20190524| null| 5|
+----------+------------+--------------+
I would like it to group it so that the rows with the same dates get combined with the correct bounceCount
and captureCount
:
+----------+------------+--------------+
| date| bounceCount| captureCount|
+----------+------------+--------------+
| 20190518| 2| 2|
| 20190521| 1| null|
| 20190519| 1| 1|
| 20190522| 1| null|
| 20190516| null| 3|
| 20190524| null| 5|
+----------+------------+--------------+
I have tried putting them together in different ways, and grouping the dataframe in different ways, but I cannot figure it. I will also be attaching this dataframe with several other columns, so I would like to know the best way to do this. Anyone know a simple way of doing this?
Upvotes: 1
Views: 859
Reputation: 4089
Try this -
Join(full) both dataframes and use coalesce function.
from pyspark.sql.functions import coalesce
joining_condition = [df1.date == df2.date]
df1\
.join(df2,joining_condition,'full')\
.select(coalesce(df1.date,df2.date).alias('date')
,df1.bounceCount
,df2.bounceCount.alias('captureCount'))\
.show()
#+--------+-----------+------------+
#| date|bounceCount|captureCount|
#+--------+-----------+------------+
#|20190518| 2| 2|
#|20190519| 1| 1|
#|20190521| 1| null|
#|20190524| null| 5|
#|20190522| 1| null|
#|20190516| null| 3|
#+--------+-----------+------------+
I think columns of df2
dataframe got interchanged. Please check. if that's the case change the column names in solution.
Upvotes: 1
Reputation: 3831
You can achieve this by outer join.
df = (
df1.select('date', 'bounceCount')
.join(
df2.select('date', 'captureCount'),
on='data', how='outer'
)
)
Upvotes: 1