Reputation:
I want to create a function that creates a new column from a left join in PySpark that details if a value in one column matches or does not match the column of another dataframe row by row.
For example, we have one PySpark dataframe (d1) that has columns ID and Name and another PySpark dataframe (d2) that has the same columns - ID and Name.
I'm trying to make a function that joins these two tables and creates a new column that shows 'True' or 'False' if the same ID exists in both dataframes.
So far, I have this
def doValuesMatch(df1, df2):
left_join = df1.join(df2, on='ID', how='left')
df1.withColumn('MATCHES?', .....(not sure what to do here))
I'm new to PySpark, can someone please help me? Thanks in advance.
Upvotes: 0
Views: 690
Reputation: 4199
It maybe something like this.
data1 = [
(1, 'come'),
(2, 'on'),
(3, 'baby'),
(4, 'hurry')
]
data2 = [
(2, 'on'),
(3, 'baby'),
(5, 'no')
]
df1 = spark.createDataFrame(data1, ['id', 'name'])
df2 = spark.createDataFrame(data2, ['id', 'name'])
df2 = df2.withColumnRenamed('name', 'name2')
df = df1.join(df2, on='id', how='left').withColumn('MATCHES', F.expr('if(name2 is null,"Flase","True")'))
df.show(truncate=False)
Upvotes: 0