user10411415
user10411415

Reputation:

Create a new column that details if rows in one PySpark dataframe matches a a row in another column of a dataframe

I want to create a function that creates a new column from a left join in PySpark that details if a value in one column matches or does not match the column of another dataframe row by row.

For example, we have one PySpark dataframe (d1) that has columns ID and Name and another PySpark dataframe (d2) that has the same columns - ID and Name.

I'm trying to make a function that joins these two tables and creates a new column that shows 'True' or 'False' if the same ID exists in both dataframes.

So far, I have this

def doValuesMatch(df1, df2):
    left_join = df1.join(df2, on='ID', how='left')
    df1.withColumn('MATCHES?', .....(not sure what to do here))

I'm new to PySpark, can someone please help me? Thanks in advance.

Upvotes: 0

Views: 690

Answers (1)

过过招
过过招

Reputation: 4199

It maybe something like this.

data1 = [
    (1, 'come'),
    (2, 'on'),
    (3, 'baby'),
    (4, 'hurry')
]
data2 = [
    (2, 'on'),
    (3, 'baby'),
    (5, 'no')
]
df1 = spark.createDataFrame(data1, ['id', 'name'])
df2 = spark.createDataFrame(data2, ['id', 'name'])
df2 = df2.withColumnRenamed('name', 'name2')
df = df1.join(df2, on='id', how='left').withColumn('MATCHES', F.expr('if(name2 is null,"Flase","True")'))
df.show(truncate=False)

Upvotes: 0

Related Questions