ritzen101
ritzen101

Reputation: 37

Join two dataframes on multiple conditions pyspark

I have 2 tables, first is the testappointment table and 2nd is the actualTests table. i want to join the 2 df in such a way that the resulting table should have column "NoShows". this column depicts that the person booked a appointment for that date, but actually did not show up for the test at the test date. Example-

 testAppointment Table
personId testDate
x 2021-02-12
y 2021-03-18
x 2020-11-01
z 2020-09-10
y 2021-01-08
z 2020-12-24
 Test Table
personId ActualtestDate
x 2021-02-12
y 2021-03-18
x 2020-11-01
z 2020-09-10

Since, the person y(2021-01-08) and z(2020-12-24) did not showup for their tests on 2 dates hence, there is no entry for them in the test table. hence, i want to have no-show for them in their respective rows after the two tables are joined.

Upvotes: 0

Views: 9703

Answers (1)

mck
mck

Reputation: 42332

You can do a left join and create a new column based on whether ActualtestDate is null:

result = testappointment.join(
    test, 
    (testappointment.testDate == test.ActualtestDate) & (testappointment.personId == test.personId), 
    'left'
).select(
    testappointment.personId, 
    testappointment.testDate, 
    test.ActualtestDate.isNull().alias('noshow')
)

result.show()
+--------+----------+------+
|personId|  testDate|noshow|
+--------+----------+------+
|       x|2021-02-12| false|
|       y|2021-03-18| false|
|       x|2020-11-01| false|
|       z|2020-09-10| false|
|       y|2021-01-08|  true|
|       z|2020-12-24|  true|
+--------+----------+------+

Upvotes: 1

Related Questions