Reputation: 9
I'm working on a project where I need to compare two dataframes. I initially utilized the compare function offered by Pandas to accomplish this task. The output from the compare function provides the desired result. However, I need to accomplish the same task, but with big data. How can I use PySpark to compare two dataframes in the same way?
pandas.DataFrame.compare https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html
I reviewed the Apache Spark documentation, but wasn't able to find a Pandas API on Spark that's relevant to the above goal.
Pandas API on Spark https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/frame.html
Upvotes: 0
Views: 579
Reputation: 21
In spark 3.5, they introduced two equality test functions for PySpark DataFrames: assertDataFrameEqual
and assertSchemaEqual
. An overview can be found at this link https://www.databricks.com/blog/simplify-pyspark-testing-dataframe-equality-functions
Upvotes: 0
Reputation: 1039
There's nothing in Spark that does that out of the box. You can do a full join of two dataframes on a common key, then check all rows that contain mismatches.
Upvotes: 0
Reputation: 27725
You can check this package datacompy
Initially it was built for pandas later it was extended for spark dataframes as well.
https://capitalone.github.io/datacompy/
pip install datacompy
The main use case for datacompy is when you need to interpret the difference between two dataframes.
There is one more thing called spark-diff:
https://github.com/G-Research/spark-extension/blob/master/DIFF.md
You can check the code given in the example for refernce.
Upvotes: 0