Pyspark question making count result into a dataframe

I have a pyspark function that looks like this. \

spark.sql("select count(*) from student_table where student_id is NULL") \
spark.sql("select count(*) from student_table where student_scores is NULL") \
spark.sql("select count(*) from student_table where student_health is NULL")

I get a result that looks like \

+-------+|count(1)|\n-------+|    0|+-------+\n|count(1)|\n-------+|    100|+-------+\n|count(1)|\n-------+|  24145|

What I want to do is to make the result into a dataframe for each column by using pandas or pyspark function. The result should have each null value result for each column.
For example,

Thanks in advance if someone can help me out.

Upvotes: 0

Answers (2)

blackbishop

Reputation: 32720

You could use union between the 3 queries but actually you can get all null counts for each column using one query:

spark.sql("""
    SELECT  SUM(INT(student_id IS NULL))     AS student_id_nb_null,
            SUM(INT(student_scores IS NULL)) AS student_scores_nb_null,
            SUM(INT(student_health IS NULL)) AS student_health_nb_null,
    FROM    student_table 
""").show()

#+------------------+----------------------+----------------------+
#|student_id_nb_null|student_scores_nb_null|student_health_nb_null|
#+------------------+----------------------+----------------------+
#|                 0|                   100|                 24145|
#+------------------+----------------------+----------------------+

Or by using DataFrame API with:

import pyspark.sql.functions as F    

df.agg(
    F.sum(F.col("student_id").isNull().cast("int")).alias("student_id_nb_null"),
    F.sum(F.col("student_scores").isNull().cast("int")).alias("student_scores_nb_null"),
    F.sum(F.col("student_health").isNull().cast("int")).alias("student_health_nb_null")
)

Upvotes: 1

notNull

Reputation: 31540

Use union all and add all your queries in one spark.sql.

Example:

spark.sql("""select "student_id" `column_name`,count(*) `null_result` from tmp where student_id is null \
union all \
select "student_scores" `column_name`,count(*) `null_result` from tmp where student_scores is null \
union all \
select "student_health" `column_name`,count(*) `null_result` from tmp where student_health is null""").\
show()

Upvotes: 0

Pyspark question making count result into a dataframe

Answers (2)

Related Questions