Reputation: 57

Group by contains nonaggregated column

I'm trying to compute the average for each row in a table that has the same postcode and group it by that postcode and the year. I'm trying to run the following query

INSERT INTO processed_clean_properties (postcode,avgYearPostcodeNorm,latitude,longitude,yearSold)
SELECT postcode, round(avg(norm)), latitude, longitude, yearSold 
FROM clean_properties 
GROUP BY postcode, yearSold

and getting the following error

" Expression #3 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'forge.clean_properties.latitude' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by"

I've looked into it and tried to disable the only_full_group_by from the sql_mode but it doesn't seem t o save any if the server is restarted it resets to default.

I've also tried adding all the selected columns to the group by condition as such

INSERT INTO processed_clean_properties (postcode,avgYearPostcodeNorm,latitude,longitude,yearSold)
SELECT postcode, round(avg(norm)), latitude, longitude, yearSold 
FROM clean_properties 
GROUP BY postcode, norm, latitude, longitude, yearSold

Doing this makes the query run indefinitely without actually doing anything.

How can i correct the initial query to work with the full_group_by condition ?

Upvotes: 1

Answers (5)

Michael - sqlbot

Reputation: 179174

From the documentation:

To tell MySQL to accept the query, you can use the ANY_VALUE() function.

https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html

Add ANY_VALUE() to the non-aggregated columns. For example, ANY_VALUE(latitude) AS latitude.

You're bumping the fact that in MySQL 5.7 an old optimization -- allowing the server to nondeterministically return any one of the values from one row in each group for non-aggregated columns -- no longer works by default. The old group by optimization is technically not valid SQL -- even though when used correctly, it's a huge performance winner. Using ANY_VALUE() enables the old behavior while making it explicit and obvious that you are asking the server to trust you that you know what you're doing, that these columns are indeed functionally dependent on the group by, so returning any value from the group is fine since the should all be the same.

If they are not all the same within each group, of course, then your query is logically flawed.

Upvotes: 6

Gordon Linoff

Reputation: 1270181

If including latitude and longitude in the group by cause the query to run forever, then this will probably have the same effect:

INSERT INTO processed_clean_properties (postcode, avgYearPostcodeNorm, latitude, longitude, yearSold)
    SELECT postcode, round(avg(norm)),
           avg(latitude), avg(longitude), yearSold 
    FROM clean_properties 
    GROUP BY postcode, yearSold;

This gets the average of latitude and longitude for the rows for the postcode. This isn't exact, but it is probably no worse than just grabbing arbitrary values.

Upvotes: 1

Rahul

Reputation: 77896

You can as well consider getting the grouping first and then perform a join

SELECT cp.latitude, cp.longitude, x.postcode, x.avg_norm, x.yearSold
FROM clean_properties cp JOIN (
SELECT postcode, round(avg(norm)) as avg_norm, yearSold 
FROM clean_properties 
GROUP BY postcode, yearSold ) x ON cp.postcode = x.postcode;

Upvotes: 0

Galz

Reputation: 6832

I think all you need to do is remove the aggregated column norm from your GROUP BY:

INSERT INTO processed_clean_properties (postcode,avgYearPostcodeNorm,latitude,longitude,yearSold)
SELECT postcode, round(avg(norm)), latitude, longitude, yearSold 
FROM clean_properties 
GROUP BY postcode, latitude, longitude, yearSold

If this still runs forever, it just means that grouping by all those additional columns is something that takes much longer. To resolve this you will need to tell us more on the table structure and more importantly, post the explain plan.

Another option, to set only_full_group_by off for good, you will need to set it in the my.cnf file. This file holds configurations that are used during server start up.

Upvotes: 2

Shadow

Reputation: 34252

The 2nd attempt is almost correct, just remove the norm field from the group by list, since you do use an aggregate function on it.

INSERT INTO processed_clean_properties (postcode,avgYearPostcodeNorm,latitude,longitude,yearSold)
SELECT postcode, round(avg(norm)), latitude, longitude, yearSold FROM clean_properties GROUP BY postcode, latitude, longitude, yearSold

If the above query is still slow, then you have to consider adding a multi-column index on the fields in the group by.

Upvotes: 1

Group by contains nonaggregated column

Answers (5)

Related Questions