yellowskull
yellowskull

Reputation: 177

Postgres, Rails and selecting columns that are not in group clause

I have the following query in which I want to group by treatment_selections.treatment_id and select the treatments.name column to be called:

@search = Trial.joins(:quality_datum, treatment_selections: :treatment)
.select('DISTINCT ON (treatment_selections.treatment_id) treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield')
.where("EXTRACT(year from season_year) BETWEEN #{params[:start_year]} AND #{params[:end_year]}")

I get the dreaded error:

PG::GroupingError: ERROR:  column "treatment_selections.treatment_id" must appear in the GROUP BY clause or be used in an aggregate function

So I switched to the following query:

@search = Trial.joins(:quality_datum, treatment_selections: :treatment)
.select('treatments.name, treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield')
.where("EXTRACT(year from season_year) BETWEEN #{params[:start_year]} AND #{params[:end_year]}")  
.group('treatment_selections.treatment_id')

Which I know won't work because of not referencing treatments.name in the group clause. But I figured the top method should of worked as I'm not grouping by anything. I understand that using such methods as AVG and SUM are not needed to be referenced in the group clause, but what about columns that don't reference any aggregate functions?

I have seen that nesting queries is a possible way of doing what I'm after, but I'm unsure of how best to implement this using the above query. Hoping someone could help me out here.

Log

SELECT treatment_selections.treatment_id, treatment.name, AVG(quality_data.yield) as yield FROM "trials" INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id" INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id" INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id" INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id" WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018) GROUP BY treatment_selections.treatment_id)

Upvotes: 4

Views: 734

Answers (2)

Krauss
Krauss

Reputation: 1093

I am not expert on Rails but lets analyze the logged query:

SELECT treatment_selections.treatment_id, treatment.name, AVG(quality_data.yield) as yield
FROM "trials"
INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id"
INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id"
INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id"
INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id"
WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018)
GROUP BY treatment_selections.treatment_id

Maybe you are relying in the clause DISTINCT ON to make this work without specifying both columns. But as you see in the log, this is not being translated into SQL.

SELECT [missing DISTINCT ON(treatment_selections.treatment_id)] treatment_selections.treatment_id, treatment.name, AVG(quality_data.yield) as yield
FROM "trials"
INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id"
INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id"
INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id"
INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id"
WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018)
GROUP BY treatment_selections.treatment_id

But even if you managed to force Rails to implement DISTINCT ON, you might not get your intended result because DISTINCT ON should return only one row per treatment_id.

The standard SQL way is to specify both columns as grouping in the aggregation:

If it is the case that treatment_id has a 1:1 relationship to treatment_name, then if you run the query without the AVG function (and without DISTINCT ON), the data would look similar to:

|   treatment_id    |       name          |  yield    |  
------------------------------------------------------
|        1          |   treatment 1       |    0.50   |
|        1          |   treatment 1       |    0.45   |
|        2          |   treatment 2       |    0.65   |
|        2          |   treatment 2       |    0.66   |
|        3          |   treatment 3       |    0.85   |

Now to use the average function you must aggregate by (both) treatment_id and treatment_name.

The reason you must specify both is because the database manager assumes that all the columns in the resulting data set are not related among each other. So, aggregating by both columns

SELECT treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield
FROM "trials"
INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id"
INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id"
INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id"
INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id"
WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018)
GROUP BY treatment_selections.treatment_id, treatments.name

will give you the following result:

|   treatment_id    |       name          |   AVG(yield)   |  
------------------------------------------------------------
|        1          |   treatment 1       |      0.475     |
|        2          |   treatment 2       |      0.655     |
|        3          |   treatment 3       |      0.85      |

To understand this better, if the resulting data in the first two columns was not related; for example:

|   year    |       name          |   yield   |  
-----------------------------------------------
|    2000   |   treatment 1       |    0.1    |
|    2000   |   treatment 1       |    0.2    |
|    2000   |   treatment 2       |    0.3    |
|    2000   |   treatment 3       |    0.4    |
|    2001   |   treatment 2       |    0.5    |
|    2001   |   treatment 3       |    0.6    |
|    2002   |   treatment 3       |    0.7    |

you must still group by year and name and, in this case, the average function would only be used when year and name are the same (note that it is not possible to do otherwise) resulting:

|   year    |       name          |   AVG(yield)   |  
---------------------------------------------------
|    2000   |   treatment 1       |     0.15       |
|    2000   |   treatment 2       |     0.3        |
|    2000   |   treatment 3       |     0.4        |
|    2001   |   treatment 2       |     0.5        |
|    2001   |   treatment 3       |     0.6        |
|    2002   |   treatment 3       |     0.7        |

Upvotes: 0

Jeechu Deka
Jeechu Deka

Reputation: 364

Selecting multiple columns (without aggregation) and using aggregate functions together won't be possible, unless you group by the selected columns - otherwise there is no way to determine how the average should be calculated (entire data set vs grouped by something). You could do this -

@search = Trial.joins(:quality_datum, treatment_selections: :treatment)
.select('treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield')
.where("EXTRACT(year from season_year) BETWEEN ? AND ?", params[:start_year], params[:end_year])  
.group('treatment_selections.treatment_id, treatments.name')

Although this might not work well for your use case if one treatments.id can be associated with mutiple treatment.name

Upvotes: 2

Related Questions