Reputation: 1584
The normal sql query :
SELECT DISTINCT(county_geoid), state_geoid, sum(PredResponse), sum(prop_count) FROM table_a GROUP BY county_geoid;
Gives me an output. However the spark sql version of this same query used in pyspark is giving me an error. How to resolve this issue?
result_county_performance_alpha = spark.sql("SELECT distinct(county_geoid), sum(PredResponse), sum(prop_count), state_geoid FROM table_a group by county_geoid")
This gives an error :
AnalysisException: u"expression 'tract_alpha.`state_geoid`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
How to resolve this issue?
Upvotes: 0
Views: 383
Reputation: 1270993
Your "normal" query should not work anywhere. The correct way to write the query is:
SELECT county_geoid, state_geoid, sum(PredResponse), sum(prop_count)
FROM table_a
GROUP BY county_geoid, state_geoid;
This should work on any database (where the columns and tables are defined and of the right types).
Your version has state_geoid
in the SELECT
, but it is not being aggregated. That is not correct SQL. It might happen to work in MySQL, but that is due to a (mis)feature in the database (that is finally being fixed).
Also, you almost never want to use SELECT DISTINCT
with GROUP BY
. And, the parentheses after the DISTINCT
make no difference. The construct is SELECT DISTINCT
. DISTINCT
is not a function.
Upvotes: 3