Sean Nguyen
Sean Nguyen

Reputation: 13118

Why pyspark sql does not count correctly with group by clause?

I load parquet file into sql context like this:

sqlCtx = SQLContext(sc)
rdd_file = sqlCtx.read.parquet("hdfs:///my_file.parquet")
rdd_file.registerTempTable("type_table")

Then I run this simple query:

sqlCtx.sql('SELECT count(name), name from type_table group by name order by count(name)').show()

The result:

+----------------+----------+
|count(name)     |name      |
+----------------+----------+
|               0|      null|
|          226307|         x|
+----------------+----------+

However, if I use groupBy from rdd set. I got a different result:

sqlCtx.sql("SELECT name FROM type_table").groupBy("name").count().show()

+----------+------+
| name     | count|
+----------+------+
|         x|226307|
|      null|586822|
+----------+------+

The count of x is the same for the two methods but null is quite different. It seems like the sql statement doesn't count null with group by correctly. Can you point out what I did wrong?

Thanks,

Upvotes: 1

Views: 1182

Answers (1)

Arunakiran Nulu
Arunakiran Nulu

Reputation: 2089

count(name) will exclude null values , if you give count(*) it will give you the null values as well .

Try below.

sqlCtx.sql('SELECT count(*), name from type_table group by name order by count(*)').show()

Upvotes: 3

Related Questions