Spark sql version of the same query does not work whereas the normal sql query does

Question

The normal sql query :

    SELECT DISTINCT(county_geoid), state_geoid, sum(PredResponse), sum(prop_count) FROM table_a GROUP BY county_geoid;

Gives me an output. However the spark sql version of this same query used in pyspark is giving me an error. How to resolve this issue?

    result_county_performance_alpha = spark.sql("SELECT distinct(county_geoid), sum(PredResponse), sum(prop_count), state_geoid FROM table_a group by county_geoid")

This gives an error :

AnalysisException: u"expression 'tract_alpha.`state_geoid`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

How to resolve this issue?

Gordon Linoff · Accepted Answer

Your "normal" query should not work anywhere. The correct way to write the query is:

SELECT county_geoid, state_geoid, sum(PredResponse), sum(prop_count)
FROM table_a
GROUP BY county_geoid, state_geoid;

This should work on any database (where the columns and tables are defined and of the right types).

Your version has state_geoid in the SELECT, but it is not being aggregated. That is not correct SQL. It might happen to work in MySQL, but that is due to a (mis)feature in the database (that is finally being fixed).

Also, you almost never want to use SELECT DISTINCT with GROUP BY. And, the parentheses after the DISTINCT make no difference. The construct is SELECT DISTINCT. DISTINCT is not a function.

Spark sql version of the same query does not work whereas the normal sql query does

Answers (1)

Related Questions