user23355692
user23355692

Reputation: 9

Compare two DataFrames using PySpark

I'm working on a project where I need to compare two dataframes. I initially utilized the compare function offered by Pandas to accomplish this task. The output from the compare function provides the desired result. However, I need to accomplish the same task, but with big data. How can I use PySpark to compare two dataframes in the same way?

pandas.DataFrame.compare https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html

I reviewed the Apache Spark documentation, but wasn't able to find a Pandas API on Spark that's relevant to the above goal.

Pandas API on Spark https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/frame.html

Upvotes: 0

Views: 579

Answers (3)

tukai
tukai

Reputation: 21

In spark 3.5, they introduced two equality test functions for PySpark DataFrames: assertDataFrameEqual and assertSchemaEqual. An overview can be found at this link https://www.databricks.com/blog/simplify-pyspark-testing-dataframe-equality-functions

Upvotes: 0

Rinat Veliakhmedov
Rinat Veliakhmedov

Reputation: 1039

There's nothing in Spark that does that out of the box. You can do a full join of two dataframes on a common key, then check all rows that contain mismatches.

Upvotes: 0

Talha Tayyab
Talha Tayyab

Reputation: 27725

You can check this package datacompy

Initially it was built for pandas later it was extended for spark dataframes as well.

https://capitalone.github.io/datacompy/

pip install datacompy

The main use case for datacompy is when you need to interpret the difference between two dataframes.


There is one more thing called spark-diff:

https://github.com/G-Research/spark-extension/blob/master/DIFF.md

You can check the code given in the example for refernce.

Upvotes: 0

Related Questions