Reputation: 13118
I load parquet file into sql context like this:
sqlCtx = SQLContext(sc)
rdd_file = sqlCtx.read.parquet("hdfs:///my_file.parquet")
rdd_file.registerTempTable("type_table")
Then I run this simple query:
sqlCtx.sql('SELECT count(name), name from type_table group by name order by count(name)').show()
The result:
+----------------+----------+
|count(name) |name |
+----------------+----------+
| 0| null|
| 226307| x|
+----------------+----------+
However, if I use groupBy from rdd set. I got a different result:
sqlCtx.sql("SELECT name FROM type_table").groupBy("name").count().show()
+----------+------+
| name | count|
+----------+------+
| x|226307|
| null|586822|
+----------+------+
The count of x is the same for the two methods but null is quite different. It seems like the sql statement doesn't count null with group by correctly. Can you point out what I did wrong?
Thanks,
Upvotes: 1
Views: 1182
Reputation: 2089
count(name) will exclude null values , if you give count(*) it will give you the null values as well .
Try below.
sqlCtx.sql('SELECT count(*), name from type_table group by name order by count(*)').show()
Upvotes: 3