Reputation: 57
I'm trying to compute the average for each row in a table that has the same postcode and group it by that postcode and the year. I'm trying to run the following query
INSERT INTO processed_clean_properties (postcode,avgYearPostcodeNorm,latitude,longitude,yearSold)
SELECT postcode, round(avg(norm)), latitude, longitude, yearSold
FROM clean_properties
GROUP BY postcode, yearSold
and getting the following error
" Expression #3 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'forge.clean_properties.latitude' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by"
I've looked into it and tried to disable the only_full_group_by from the sql_mode but it doesn't seem t o save any if the server is restarted it resets to default.
I've also tried adding all the selected columns to the group by condition as such
INSERT INTO processed_clean_properties (postcode,avgYearPostcodeNorm,latitude,longitude,yearSold)
SELECT postcode, round(avg(norm)), latitude, longitude, yearSold
FROM clean_properties
GROUP BY postcode, norm, latitude, longitude, yearSold
Doing this makes the query run indefinitely without actually doing anything.
How can i correct the initial query to work with the full_group_by condition ?
Upvotes: 1
Views: 3143
Reputation: 179174
From the documentation:
To tell MySQL to accept the query, you can use the
ANY_VALUE()
function.https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html
Add ANY_VALUE()
to the non-aggregated columns. For example, ANY_VALUE(latitude) AS latitude
.
You're bumping the fact that in MySQL 5.7 an old optimization -- allowing the server to nondeterministically return any one of the values from one row in each group for non-aggregated columns -- no longer works by default. The old group by optimization is technically not valid SQL -- even though when used correctly, it's a huge performance winner. Using ANY_VALUE()
enables the old behavior while making it explicit and obvious that you are asking the server to trust you that you know what you're doing, that these columns are indeed functionally dependent on the group by, so returning any value from the group is fine since the should all be the same.
If they are not all the same within each group, of course, then your query is logically flawed.
Upvotes: 6
Reputation: 1270181
If including latitude
and longitude
in the group by
cause the query to run forever, then this will probably have the same effect:
INSERT INTO processed_clean_properties (postcode, avgYearPostcodeNorm, latitude, longitude, yearSold)
SELECT postcode, round(avg(norm)),
avg(latitude), avg(longitude), yearSold
FROM clean_properties
GROUP BY postcode, yearSold;
This gets the average of latitude
and longitude
for the rows for the postcode. This isn't exact, but it is probably no worse than just grabbing arbitrary values.
Upvotes: 1
Reputation: 77896
You can as well consider getting the grouping first and then perform a join
SELECT cp.latitude, cp.longitude, x.postcode, x.avg_norm, x.yearSold
FROM clean_properties cp JOIN (
SELECT postcode, round(avg(norm)) as avg_norm, yearSold
FROM clean_properties
GROUP BY postcode, yearSold ) x ON cp.postcode = x.postcode;
Upvotes: 0
Reputation: 6832
I think all you need to do is remove the aggregated column norm
from your GROUP BY:
INSERT INTO processed_clean_properties (postcode,avgYearPostcodeNorm,latitude,longitude,yearSold)
SELECT postcode, round(avg(norm)), latitude, longitude, yearSold
FROM clean_properties
GROUP BY postcode, latitude, longitude, yearSold
If this still runs forever, it just means that grouping by all those additional columns is something that takes much longer. To resolve this you will need to tell us more on the table structure and more importantly, post the explain plan.
Another option, to set only_full_group_by
off for good, you will need to set it in the my.cnf file. This file holds configurations that are used during server start up.
Upvotes: 2
Reputation: 34252
The 2nd attempt is almost correct, just remove the norm
field from the group by list, since you do use an aggregate function on it.
INSERT INTO processed_clean_properties (postcode,avgYearPostcodeNorm,latitude,longitude,yearSold)
SELECT postcode, round(avg(norm)), latitude, longitude, yearSold FROM clean_properties GROUP BY postcode, latitude, longitude, yearSold
If the above query is still slow, then you have to consider adding a multi-column index on the fields in the group by.
Upvotes: 1