Tam
Tam

Reputation: 3987

What could be Spark compatible Data Quality / profiling Framework which should be light enough to process large dataset 100+ gb of parquet from S3

I have data of (100GB+) stored in S3 (particularly in Parquet format). It's essential to choose the right tool to perform data quality checks and profiling.

I came across 3 different tools those are Great Expectations, Deequ, and Cuallee.

Is there any Comparison anyone could share based on architecture, scalability, and suitability for working with large datasets stored in S3?

I have tried with Deequ with PySpark it takes too long to compute the data quality check for even 1000+ rows.

Furthermore, if anyone could share working and efficient examples for Great Expectations, Deequ, and Cuallee? I have referred to this link, It mentioned that

"cuallee It is an open-source framework, that supports the Observation API to make testing on billions of records, super-fast, and less resource greedy as pydeequ. Is intuitive, and easy to use"

What could be the best tools to choose from the above 3 or something else please share some insights.

Adding this as well :

I came across this comparison like this for : PyDeequ vs. cuallee enter image description here

I have referred to these documents and tried to implement Deequ, but it takes too long to run or runs forever. I wonder why most of the examples of Deequ Scala are not in Spark. Does anyone know?

Code :

import os
import pydeequ
from pydeequ.checks import Check, CheckLevel
from pydeequ.verification import VerificationSuite
from pydeequ.verification import VerificationResult

# Ensure the SPARK_VERSION environment variable is set correctly
os.environ['SPARK_VERSION'] = '3.1'  # Update this to match your Spark version

# Create a Spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PyDeequExample") \
    .getOrCreate()

# Load your data into a Spark DataFrame
df = spark.read.parquet("s3://your-bucket/customer-orders/")

# Define the data quality check
check = Check(spark, CheckLevel.Error, "Data Quality Check")

check_result = VerificationSuite(spark) \
    .onData(df) \
    .addCheck(
        check.hasSize(lambda x: x > 1000)  # Expect more than 1000 rows
             .isComplete("order_id")  # Ensure order_id is not null
             .isUnique("order_id")  # Ensure order_id has no duplicates
             .isComplete("customer_id")  # Ensure customer_id is not null
             .isContainedIn("order_amount", (0, 10000))  # Ensure order_amount is between 0 and 10000
    ) \
    .run()

# Get the check results in a human-readable format
check_result_df = VerificationResult.checkResultsAsDataFrame(spark, check_result)
check_result_df.show()

Upvotes: 0

Views: 79

Answers (0)

Related Questions