Reputation: 22760
I need to join a single row to a table output based on an aggregate function, in this case of most recent corresponding record. Various other questions on this topic seem to work on the basis that both tables values are required (INNER JOIN
, etc.) but in my case the aggregate needs to work on a LEFT JOIN
table that is many times going to be NULL
.
MySQL 5.7; Here is an couple of illistrative tables:
Core_data:
-----------
create table `core`
(
core_id int(8) unsigned auto_increment primary key,
some_name varchar(80) null,
some_data varchar(80) null,
some_values .... etc.
)
Linked_data:
------------
create table `linked_data`
(
link_id smallint(6) unsigned auto_increment primary key,
core_id int(8) unsigned
data_date date,
some_linked_data_values varchar(80) null
)
I have a query dealing with dozens of tables. selecting 1 row from the Core table and selecting various LEFT JOIN
data from dozens of other tables.
The illustrated linked data table has data that is dated, and the date is important, that only the most recent is returned.
linked_data
------------
link_id | core_id | data_date | data_value | ...
-------------------------------------------------
1 | 2 | 2020-09-03 | something | ...
2 | 4 | 2019-07-29 | whatever | ...
3 | 1 | 2017-11-09 | yews | ...
4 | 4 | 2018-04-10 | socks | ...
I want to only join the row with core_id = 4 AND the maximum date value. How can I create this within the JOIN
scenario; I can't put the MAX
aggregate into the JOIN ... ON
condition.
My SQL is something like this:
SELECT ... many columns ...,
ld.data_value,
ld.data_date,
more.columns ...
FROM core
LEFT JOIN table1 ON core.core_id = table1.core_id
LEFT JOIN table2 ON core.core_id = table2.core_id
LEFT JOIN table3 ON core.core_id = table3.core_id
... etc ...
LEFT JOIN linked_data ld ON core.core_id = ld.core_id AND MAX(ld.data_date)
WHERE ... core_id = value
One table I need only a result row that has the highest value of a column (data based), there is no reason for linked_data
to hold any data so the LEFT JOIN may return NULL
For core_id = 4
I want to be able to output a single SQL row result containing linked_data.data_value = whatever
. For core_id = 5
I want to be able to output the rest of the data but nothing from linked_data.
table.
This answer is noted as correct but is also noted that it will become very slow very quickly with larger amounts of data.
This answer put the qualifier in the WHERE
clause, but there's no promise that linked_data
will contain any result at all so, I can of course add further conditionals (check if into the WHERE
clause here but I was hoping to avoid this.
This MySQL post has another possible solution but comments on this also state it is very slow (that may be user error on their part, I've not tested it yet).
I have also tried using a SELECT in the LEFT JOIN like so:
SELECT ... many columns ...,
ld.data_value,
ld.data_date,
more.columns ...
FROM core
LEFT JOIN table1 ON core.core_id = table1.core_id
LEFT JOIN table2 ON core.core_id = table2.core_id
LEFT JOIN table3 ON core.core_id = table3.core_id
... etc ...
LEFT JOIN (
SELECT linked_data FROM linked_data ldi WHERE core.core_id = ldi.core_id AND MAX(ldi.data_date)
) as ld ON core.core_id = ldi.core_id
WHERE ... core_id = value
Referenced from this Q&A
But this still tells me Aggregate calls are not allowed here
EDIT: I found why the aggregate wasn't allowed; a simple syntax mistake on my part; but I have put up a full answer to clarify this Q&A as I couldn't find any relative answers when I was searching, so this may be useful to someone.
If anyone has a more correct way of solving the original issue please share!
Upvotes: 1
Views: 1675
Reputation: 48139
In its shortest sample answer for you from what provided... You can have a pre-query resulting in an alias for the join. This pre-query can group / max per Core.
That said, and since the linked-data table is auto-increment, I would assume (yeah, I know about assume, but you can confirm) that as each record is added, the date will always be the date added. So and ID of 100 may have a date of Jan 14, 2020, you would never have an earlier date record with a higher ID such as ID 101 = Nov 3, 2019. As each ID added, higher date than last record regardless of the core id.
You can then continue your additional "left-joins" to other tables as needed.
REVISION FROM COMMENT CLARIFICATION
Martin, from the clarification you provided about the data coming from multiple sources and the date could represent older data, just revise the pre-query inner sql to the following. The query is heavily commented to clarify how the over/partition query works in this scenario
Now, integrating with all the rest of your stuff. I will only join to your primary core table as a left-join
select
cd.core_id,
cd.some_name,
cd.some_data,
cd.some_values,
ld2.link_id,
ld2.data_date,
ld2.some_linked_data_values
from
core_data cd
left join
( select pq1.*
from
-- first, all columns I want to have returned from
-- the linked_data table
( select core_id,
data_date,
link_id,
-- dense_rank() returns sequential counter value
-- starting at 1 based on every change of the
-- PARTITION BY Core_ID in next part below
dense_rank()
over ( partition by
-- primary sorting by the core_id
core_id
order by
-- then within each core_id, descending by date
data_date desc,
-- and then by the link_id descending, just in case
-- there are multiple records for the same core_id
-- AND the same date... So you get the most
-- recently added linked_data record for given core
link_id desc ) as sqlrow
from linked_data ) pq1
where
-- now, from inner partition/over query, only get the record
-- where the sqlrow = 1, as result of dense_rank()
-- that resets to 1 every time core_id changes
pq1.sqlrow=1 ) PQ
on cd.core_id = PQ.core_id
LEFT JOIN linked_data ld2
on PQ.Link_id = ld2.link_id
The inner query with OVER / PARTITION BY is basically making a first-pass of the data and ordering it first by the partition (the core id), then sub-sorted by the data_date DESCENDING (so most recent date first regardless of being added earlier or later from the import from whatever external sources), then sub-sorted by link_id descending based on the most recent record added for any given date.
The final outer WHERE clause is basically stating only give me back the first row for every core_id. So now, you have the proper critical elements to re-do the LEFT join back to the original core ID, yet have the proper link_id to get the proper record at your final query result.
Upvotes: 1
Reputation: 22760
After writing out this whole question, I found that a simple syntax fix on my latest example SQL resolved the correct way to do this (and I've done this in the past the correct way but had not learnt it).
I note the lack of (or rather my inability to easily find) a clear definitive answer on the Interwebs to my query, so here's my methodology...
The correct way to do this type of aggregated join with respect to getting 0 or 1 results from the joined table is as follows:
LEFT JOIN
and then wrap the subquery in brackets and set it as ...
as per usual.ON ...
section.ORDER BY
in the sub query and then set the aggregate at that point.So;
SELECT ... many columns ...,
ld.data_value,
ld.data_date,
more.columns ...
FROM core
LEFT JOIN table1 ON core.core_id = table1.core_id
LEFT JOIN table2 ON core.core_id = table2.core_id
LEFT JOIN table3 ON core.core_id = table3.core_id
... etc ...
LEFT JOIN (
SELECT linked_data FROM linked_data ldi WHERE optional = 1 ORDER BY MAX(ldi.data_date)
) as ld ON core.core_id = ldi.core_id
WHERE ... core_id = value
There is no need for a WHERE
in the subquery but obviously if you do then it should be static and not need to reference the outer table, as that's done on the ON
clause. I put in WHERE optional = 1
for familiarity.
The nature of LEFT JOIN
is it will return 0 or 1 results so I find there's no need for much else.
If anyone has a better way of solving my original issue please let me know!
P.s> I have not efficiency tested this answer at all.
Upvotes: 0