Reputation: 9

Compare two DataFrames using PySpark

I'm working on a project where I need to compare two dataframes. I initially utilized the compare function offered by Pandas to accomplish this task. The output from the compare function provides the desired result. However, I need to accomplish the same task, but with big data. How can I use PySpark to compare two dataframes in the same way?

pandas.DataFrame.compare https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html

I reviewed the Apache Spark documentation, but wasn't able to find a Pandas API on Spark that's relevant to the above goal.

Pandas API on Spark https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/frame.html

Upvotes: 0

Answers (3)

tukai

Reputation: 21

In spark 3.5, they introduced two equality test functions for PySpark DataFrames: assertDataFrameEqual and assertSchemaEqual. An overview can be found at this link https://www.databricks.com/blog/simplify-pyspark-testing-dataframe-equality-functions

Upvotes: 0

Rinat Veliakhmedov

Reputation: 1039

There's nothing in Spark that does that out of the box. You can do a full join of two dataframes on a common key, then check all rows that contain mismatches.

Upvotes: 0