Julaayi
Julaayi

Reputation: 499

How to compare the counts of two dataframes in PySpark?

I am using PySpark where I am building Temporary Views out of dataframes. As part of it, I would like to know how to compare the counts of two dataframes and if they don't match, throw an error.

My code looks like this:

df1 = spark.sql ("""SELECT Col1, Col2, Col3, Col4, Col5 FROM Table1""")
df1.createOrReplaceTempView("df1")

df2 = spark.sql ("""SELECT Col1, Col2 FROM Table2""")
df2.createOrReplaceTempView("df2")

df_join = spark.sql ("""SELECT Table1.Col1, Table1.Col2 FROM Table1 LEFT OUTER JOIN Table2 ON Table1.Col1 = Table2.Col1""")
df_join .createOrReplaceTempView("df_join")

The counts of df1 and df_join should always match. So, I would like to compare the counts between these two dataframes and throw an error if those doesn't match.

Looking for some help here please...

Upvotes: 1

Views: 1407

Answers (2)

care1e55
care1e55

Reputation: 26

DataFrame object has count() method: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.count

So it's going to be as simple as:

if df1.count() != df_join.count():
    raise ValueError('Dataframes counts doesnt match')

Upvotes: 1

Lamanus
Lamanus

Reputation: 13541

From the join, add the count with the condition.

df_join = spark.sql("""
    SELECT 
        Table1.Col1,
        Table1.Col2,
        count(Table1.Col1) as CountDF1,
        count(IF(Table2.Col1 is not null, 1)) as CountDF2
    FROM 
        Table1 
    LEFT OUTER JOIN 
        Table2 
    ON 
        Table1.Col1 = Table2.Col1
""")

And compare those two counts.

Or you just can compare it simply as follows:

df1.count() == df2.count()

Upvotes: 1

Related Questions