Reputation: 499
I am using PySpark where I am building Temporary Views out of dataframes. As part of it, I would like to know how to compare the counts of two dataframes and if they don't match, throw an error.
My code looks like this:
df1 = spark.sql ("""SELECT Col1, Col2, Col3, Col4, Col5 FROM Table1""")
df1.createOrReplaceTempView("df1")
df2 = spark.sql ("""SELECT Col1, Col2 FROM Table2""")
df2.createOrReplaceTempView("df2")
df_join = spark.sql ("""SELECT Table1.Col1, Table1.Col2 FROM Table1 LEFT OUTER JOIN Table2 ON Table1.Col1 = Table2.Col1""")
df_join .createOrReplaceTempView("df_join")
The counts of df1 and df_join should always match. So, I would like to compare the counts between these two dataframes and throw an error if those doesn't match.
Looking for some help here please...
Upvotes: 1
Views: 1407
Reputation: 26
DataFrame object has count() method: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.count
So it's going to be as simple as:
if df1.count() != df_join.count():
raise ValueError('Dataframes counts doesnt match')
Upvotes: 1
Reputation: 13541
From the join, add the count with the condition.
df_join = spark.sql("""
SELECT
Table1.Col1,
Table1.Col2,
count(Table1.Col1) as CountDF1,
count(IF(Table2.Col1 is not null, 1)) as CountDF2
FROM
Table1
LEFT OUTER JOIN
Table2
ON
Table1.Col1 = Table2.Col1
""")
And compare those two counts.
Or you just can compare it simply as follows:
df1.count() == df2.count()
Upvotes: 1