Reputation: 1223
I am using the below code to compare 2 columns in data frame. I dont want to do it in pandas. Can someone help how to compare using spark data frames?
df1=context.spark.read.option("header",True).csv("./test/input/test/Book1.csv",)
df1=df1.withColumn("Curated", dataclean.clean_email(col("email")))
df1.show()
assert_array_almost_equal(df1['expected'], df1['Curated'],verbose=True)
Upvotes: 1
Views: 975
Reputation: 7326
One efficient way would be to try to identify the first difference as soon as possible. One way to achieve that is via left-anti joins:
assert(df1.join(df1, (df1['expected'] == df1['Curated']), "leftanti").first() != None)
Upvotes: 1