Reputation: 3987
I have data of (100GB+) stored in S3 (particularly in Parquet format). It's essential to choose the right tool to perform data quality checks and profiling.
I came across 3 different tools those are Great Expectations
, Deequ
, and Cuallee
.
Is there any Comparison anyone could share based on architecture, scalability, and suitability for working with large datasets stored in S3?
I have tried with Deequ with PySpark it takes too long to compute the data quality check for even 1000+ rows.
Furthermore, if anyone could share working and efficient examples for Great Expectations, Deequ, and Cuallee? I have referred to this link, It mentioned that
"cuallee It is an open-source framework, that supports the Observation API to make testing on billions of records, super-fast, and less resource greedy as pydeequ. Is intuitive, and easy to use"
What could be the best tools to choose from the above 3 or something else please share some insights.
Adding this as well :
I came across this comparison like this for : PyDeequ vs. cuallee
I have referred to these documents and tried to implement Deequ, but it takes too long to run or runs forever. I wonder why most of the examples of Deequ Scala are not in Spark. Does anyone know?
Code :
import os
import pydeequ
from pydeequ.checks import Check, CheckLevel
from pydeequ.verification import VerificationSuite
from pydeequ.verification import VerificationResult
# Ensure the SPARK_VERSION environment variable is set correctly
os.environ['SPARK_VERSION'] = '3.1' # Update this to match your Spark version
# Create a Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PyDeequExample") \
.getOrCreate()
# Load your data into a Spark DataFrame
df = spark.read.parquet("s3://your-bucket/customer-orders/")
# Define the data quality check
check = Check(spark, CheckLevel.Error, "Data Quality Check")
check_result = VerificationSuite(spark) \
.onData(df) \
.addCheck(
check.hasSize(lambda x: x > 1000) # Expect more than 1000 rows
.isComplete("order_id") # Ensure order_id is not null
.isUnique("order_id") # Ensure order_id has no duplicates
.isComplete("customer_id") # Ensure customer_id is not null
.isContainedIn("order_amount", (0, 10000)) # Ensure order_amount is between 0 and 10000
) \
.run()
# Get the check results in a human-readable format
check_result_df = VerificationResult.checkResultsAsDataFrame(spark, check_result)
check_result_df.show()
Upvotes: 0
Views: 79