Reputation: 489
i have search a lot to find a solution to get the max id using group by of two columns as a pair in a data set but none of the queries i have found and used worked as expected. Below is an example data set:
id | tour_id | p1 | stage | rnd | assoc1 | p2 | assoc2 | winner |
---|---|---|---|---|---|---|---|---|
996057 | 5277 | 107028 | Main Draw | 32 | GER | 110673 | IRI | 107028 |
996101 | 5277 | 107028 | Main Draw | 16 | GER | 105136 | FRA | 107028 |
996126 | 5277 | 107028 | Main Draw | 8 | GER | 112074 | SWE | 107028 |
996133 | 5277 | 107028 | Main Draw | 4 | GER | 123980 | JPN | 107028 |
996139 | 5277 | 107028 | Main Draw | 2 | GER | 121582 | TPE | 107028 |
996037 | 5277 | 116620 | Main Draw | 32 | GER | 121582 | TPE | 121582 |
996037 | 5277 | 121582 | Main Draw | 32 | TPE | 116620 | GER | 121582 |
996097 | 5277 | 121582 | Main Draw | 16 | TPE | 104314 | IND | 121582 |
996121 | 5277 | 121582 | Main Draw | 8 | TPE | 112092 | NGR | 121582 |
996132 | 5277 | 121582 | Main Draw | 4 | TPE | 112062 | FRA | 121582 |
996139 | 5277 | 121582 | Main Draw | 2 | TPE | 107028 | GER | 107028 |
996324 | 5278 | 107028 | Main Draw | 32 | GER | 100439 | EGY | 107028 |
996362 | 5278 | 107028 | Main Draw | 16 | GER | 104314 | IND | 107028 |
996379 | 5278 | 107028 | Main Draw | 8 | GER | 116853 | SWE | 107028 |
996390 | 5278 | 107028 | Main Draw | 4 | GER | 123980 | JPN | 123980 |
996283 | 5278 | 116620 | Main Draw | 64 | GER | 121514 | KOR | 121514 |
996313 | 5278 | 121582 | Main Draw | 32 | TPE | 106296 | POR | 121582 |
996357 | 5278 | 121582 | Main Draw | 16 | TPE | 102968 | AUT | 121582 |
996380 | 5278 | 121582 | Main Draw | 8 | TPE | 102761 | GER | 102761 |
998765 | 5299 | 101222 | Main Draw | 64 | GER | 118671 | DEN | 101222 |
998788 | 5299 | 101222 | Main Draw | 32 | GER | 102380 | ENG | 101222 |
998801 | 5299 | 101222 | Main Draw | 16 | GER | 116620 | GER | 101222 |
998807 | 5299 | 101222 | Main Draw | 8 | GER | 116853 | SWE | 101222 |
998810 | 5299 | 101222 | Main Draw | 4 | GER | 112074 | SWE | 101222 |
998812 | 5299 | 101222 | Main Draw | 2 | GER | 107028 | GER | 101222 |
998773 | 5299 | 107028 | Main Draw | 64 | GER | 120168 | TUR | 107028 |
998797 | 5299 | 107028 | Main Draw | 32 | GER | 102891 | CRO | 107028 |
998805 | 5299 | 107028 | Main Draw | 16 | GER | 104379 | SWE | 107028 |
998809 | 5299 | 107028 | Main Draw | 8 | GER | 104036 | CZE | 107028 |
998811 | 5299 | 107028 | Main Draw | 4 | GER | 102841 | POR | 107028 |
998812 | 5299 | 107028 | Main Draw | 2 | GER | 101222 | GER | 101222 |
998757 | 5299 | 116620 | Main Draw | 64 | GER | 101192 | ITA | 116620 |
998794 | 5299 | 116620 | Main Draw | 32 | GER | 115449 | AUT | 116620 |
998801 | 5299 | 116620 | Main Draw | 16 | GER | 101222 | GER | 101222 |
What I would like to get is the following output which is basically the max(id) of the grouping of p1 and tour_id
id | tour_id | p1 | stage | rnd | assoc1 | p2 | assoc2 | winner |
---|---|---|---|---|---|---|---|---|
996139 | 5277 | 107028 | Main Draw | 2 | GER | 121582 | TPE | 107028 |
996037 | 5277 | 116620 | Main Draw | 32 | GER | 121582 | TPE | 121582 |
996139 | 5277 | 121582 | Main Draw | 2 | TPE | 107028 | GER | 107028 |
996390 | 5278 | 107028 | Main Draw | 4 | GER | 123980 | JPN | 123980 |
996283 | 5278 | 116620 | Main Draw | 64 | GER | 121514 | KOR | 121514 |
996380 | 5278 | 121582 | Main Draw | 8 | TPE | 102761 | GER | 102761 |
998812 | 5299 | 101222 | Main Draw | 2 | GER | 107028 | GER | 101222 |
998812 | 5299 | 107028 | Main Draw | 2 | GER | 101222 | GER | 101222 |
998801 | 5299 | 116620 | Main Draw | 16 | GER | 101222 | GER | 101222 |
Any help is appreciated.
Upvotes: 0
Views: 42
Reputation: 3490
Generally, I would take a simple query to get the max id for the conditions, and then either use it as a subquery or join, depending on the use case. Take a look at this fiddle:
I've inserted your data and then a series of queries. Here's the first one, just to get the maxID for each combination of tour_id and p1:
select tour_id, p1, max(id) as maxID
from t group by tour_id, p1;
which you can then use in a subquery to retrieve any rows that match those IDs like so:
select * from t
where id in (
select max(id)
from t group by tour_id, p1
);
or as a JOIN:
select t.* from t
join (
select max(id) as maxID
from t group by tour_id, p1
) ids on t.id = ids.maxID;
JOINs are usually more performant than IN for larger data sets, but that is not a hard and fast rule and the line really isn't well defined. I've included it here just for reference.
Now, these queries SHOULD be returning the same results, but it seems that the ID you're fetching the max value for isn't a unique ID, so they aren't, and it really depends on what you are trying to accomplish as to which answer is right. Here's one more option using window functions, which are really overkill for this, but let's look:
select tour_id, p1,
first_value(id) OVER (partition by tour_id, p1 order by id desc) as maxID,
first_value(stage) OVER (partition by tour_id, p1 order by id desc) as stage,
first_value(rnd) OVER (partition by tour_id, p1 order by id desc) as rnd,
first_value(assoc1) OVER (partition by tour_id, p1 order by id desc) as assoc1,
first_value(p2) OVER (partition by tour_id, p1 order by id desc) as p2,
first_value(assoc2) OVER (partition by tour_id, p1 order by id desc) as assoc2,
first_value(winner) OVER (partition by tour_id, p1 order by id desc) as winner
from t
Now this returns a LOT more rows, but a lot of them are duplicates, so let's add DISTINCT to just get the uniques:
select DISTINCT tour_id, p1,
first_value(id) OVER (partition by tour_id, p1 order by id desc) as maxID,
first_value(stage) OVER (partition by tour_id, p1 order by id desc) as stage,
first_value(rnd) OVER (partition by tour_id, p1 order by id desc) as rnd,
first_value(assoc1) OVER (partition by tour_id, p1 order by id desc) as assoc1,
first_value(p2) OVER (partition by tour_id, p1 order by id desc) as p2,
first_value(assoc2) OVER (partition by tour_id, p1 order by id desc) as assoc2,
first_value(winner) OVER (partition by tour_id, p1 order by id desc) as winner
from t
and now we're down to something that looks a little more like what you were after. For comparison, I have included the three queries side by side, ordered by id and with the columns all in the same order:
select DISTINCT
first_value(id) OVER (partition by tour_id, p1 order by id desc) as maxID,
tour_id, p1,
first_value(stage) OVER (partition by tour_id, p1 order by id desc) as stage,
first_value(rnd) OVER (partition by tour_id, p1 order by id desc) as rnd,
first_value(assoc1) OVER (partition by tour_id, p1 order by id desc) as assoc1,
first_value(p2) OVER (partition by tour_id, p1 order by id desc) as p2,
first_value(assoc2) OVER (partition by tour_id, p1 order by id desc) as assoc2,
first_value(winner) OVER (partition by tour_id, p1 order by id desc) as winner
from t order by 1;
select * from t
where id in (
select max(id)
from t group by tour_id, p1
) order by id;
select t.* from t
join (
select max(id) as maxID
from t group by tour_id, p1
) ids on t.id = ids.maxID
order by t.id;
The result set using the window functions seems to have the same output as you're looking for, but let me say that it seems like window functions are overkill for a case this simple, so I'm wondering if you need some unique ID instead. If you don't have a unique primary (autoincrementing) ID in your table(s), you should. It will save you a lot of headache at some point down the road. If you do, I wonder why we aren't using that instead of the non-unique one.
Let me know if this helps, or if anything is unclear.
Upvotes: 1