I have a table which stores predictions from a machine learning model. This is a model that each hour ("predicted_at") predicts a value for the next 24 hours("predicted_for"). This means that the table have many different values for each "id" and "predicted_for". Example of how the the table looks like for one ID and one predicted_for timestamp: value id predicted_at predicted_for 0.1825819489860161 6970631400382957 2021-08-21 16:00:00 UTC 2021-08-21 23:00:00 UTC 0.14882256844401498 6970631400382957 2021-08-21 17:00:00 UTC 2021-08-20 23:00:00 UTC 0.17357749613149909 6970631400382957 2021-08-21 17:00:00 UTC 2021-08-21 23:00:00 UTC 0.20283864055714163 6970631400382957 2021-08-21 18:00:00 UTC 2021-08-21 23:00:00 UTC 0.18582932551434195 6970631400382957 2021-08-21 19:00:00 UTC 2021-08-21 23:00:00 UTC 0.18859835855398877 6970631400382957 2021-08-21 20:00:00 UTC 2021-08-21 23:00:00 UTC 0.15969341546283378 6970631400382957 2021-08-21 21:00:00 UTC 2021-08-21 23:00:00 UTC 0.21578765348925422 6970631400382957 2021-08-21 22:00:00 UTC 2021-08-21 23:00:00 UTC What I want to do is to query this data so that I get, for each ID, only one predicted value for each "predicted_for" timestemp, and I want this value to be "predicted_at" at "predicted_for - t(2) hours". For the example, the result would give me the following table: value id predicted_at predicted_for 0.15969341546283378 6970631400382957 2021-08-21 21:00:00 UTC 2021-08-21 23:00:00 UTC I assume I would have some kind of group by and a having clause, but I have not figured out how to solve it. Anyone have an idea?

Reputation: 311

Query table with multiple "duplicates", getting the most recent

I have a table which stores predictions from a machine learning model. This is a model that each hour ("predicted_at") predicts a value for the next 24 hours("predicted_for"). This means that the table have many different values for each "id" and "predicted_for".

Example of how the the table looks like for one ID and one predicted_for timestamp:

value	id	predicted_at	predicted_for
0.1825819489860161	6970631400382957	2021-08-21 16:00:00 UTC	2021-08-21 23:00:00 UTC
0.14882256844401498	6970631400382957	2021-08-21 17:00:00 UTC	2021-08-20 23:00:00 UTC
0.17357749613149909	6970631400382957	2021-08-21 17:00:00 UTC	2021-08-21 23:00:00 UTC
0.20283864055714163	6970631400382957	2021-08-21 18:00:00 UTC	2021-08-21 23:00:00 UTC
0.18582932551434195	6970631400382957	2021-08-21 19:00:00 UTC	2021-08-21 23:00:00 UTC
0.18859835855398877	6970631400382957	2021-08-21 20:00:00 UTC	2021-08-21 23:00:00 UTC
0.15969341546283378	6970631400382957	2021-08-21 21:00:00 UTC	2021-08-21 23:00:00 UTC
0.21578765348925422	6970631400382957	2021-08-21 22:00:00 UTC	2021-08-21 23:00:00 UTC

What I want to do is to query this data so that I get, for each ID, only one predicted value for each "predicted_for" timestemp, and I want this value to be "predicted_at" at "predicted_for - t(2) hours".

For the example, the result would give me the following table:

value	id	predicted_at	predicted_for
0.15969341546283378	6970631400382957	2021-08-21 21:00:00 UTC	2021-08-21 23:00:00 UTC

I assume I would have some kind of group by and a having clause, but I have not figured out how to solve it.

Anyone have an idea?

Upvotes: 1

Answers (3)

Mikhail Berlyant

Reputation: 173190

Consider below approach

select any_value(value) value, id, 
  any_value(predicted_at) predicted_at, predicted_for,   
from data t
where predicted_at = predicted_for - interval 2 hour
group by id, predicted_for

if applied to sample data in your question - output is

Upvotes: 1

Gregorio Palamà

Reputation: 1952

Your query should look like this one:

SELECT * 
FROM `YOURTABLE`
WHERE DATETIME_DIFF(predicted_for, predicted_at, HOUR) = 2

Note that it is not ensuring that you only have one record that matches the rule of having predicted_at preceding by 2 hours the value in predicted_for. To also ensure this, the query should be like this:

SELECT * 
FROM `YOURTABLE` tab
WHERE predicted_at = (
    SELECT MAX(predicted_at)
    FROM `YOURTABLE` tab2
    WHERE DATETIME_DIFF(tab2.predicted_for, tab2.predicted_at, HOUR) = 2
    AND tab2.id = tab.id
)

Upvotes: 0

Robbi

Reputation: 1

do a query like:

SELECT p.* FROM
  ( SELECT max(predicted_at) as predicted_at_max 
    FROM youraboveschowedtable 
    GROUP BY id ) p1,
DATEADD(hour, -2, predicted_at) as predicted_minus2
JOIN youraboveschowedtable p
ON p.predicted_at = p1.predicted_at_max

litle altert Source = SQL query return duplicate with most recent date and more

Upvotes: 0

Query table with multiple &quot;duplicates&quot;, getting the most recent

Answers (3)

Related Questions

Query table with multiple "duplicates", getting the most recent