Reputation: 339
I have a table as follows:
create table issue_attributes (
issue_id number,
attr_timestamp timestamp,
attribute_name varchar2(500),
attribute_value varchar2(500),
CONSTRAINT ia-pk PRIMARY KEY (issue_id, attr_timestamp, attribute_name)
)
The idea here is to have a bunch of attributes (status, owner, etc) associated with an issue, while retaining the ability to keep a history of attribute changes.
Due to errors in data imports, we have repeated data in the table:
select issue_id, attr_timestamp, attribute_name, attribute_value
from issue_attributes where issue_id = 1 and attribute_name = 'OWNER';
yields sample data of:
1, 01-JAN-2011 12:00, 'OWNER', '[email protected]'
1, 01-FEB-2011 12:00, 'OWNER', '[email protected]'
1, 01-MAR-2011 12:00, 'OWNER', '[email protected]'
1, 01-APR-2011 12:00, 'OWNER', '[email protected]'
I would like to be able to find all instances of repeated attributes and just keep the latest one. The desired result set in this case of sample data would be:
1, 01-JAN-2011 12:00, 'OWNER', '[email protected]'
We could also have a case of the sample data being:
2, 01-JAN-2011 12:00, 'OWNER', '[email protected]'
2, 01-FEB-2011 12:00, 'OWNER', '[email protected]'
2, 01-MAR-2011 12:00, 'OWNER', '[email protected]'
2, 01-APR-2011 12:00, 'OWNER', '[email protected]'
in which case I'd like to have the result be:
2, 01-JAN-2011 12:00, 'OWNER', '[email protected]'
2, 01-FEB-2011 12:00, 'OWNER', '[email protected]'
2, 01-MAR-2011 12:00, 'OWNER', '[email protected]'
This is on Oracle 11g, so I could use SQL or PL/SQL to fix the data. I was thinking that one way to do it is via PL/SQL, and for each issue_id, descend sort the attributes, and if attribute(x) = attribute(x-1), then delete attribute(x). That seems a bit much like brute force, and I was wonder if there was an elegant way to do this just via SQL.
Upvotes: 1
Views: 96
Reputation: 2895
I would looking at the previous row and seeing if the data has changed. This can be done by using the LAG
analytic function.
You can look back at the previous value, ordering on the timestamp. If the data has changed, then you want to keep it. The first row is always kept because LAG
returns NULL
when there is no prior data.
with issue_attributes as (
select 1 as issue_id, date '2011-01-01' as attr_timestamp,
'OWNER' as attribute_name, '[email protected]' as attribute_value from dual union all
select 1 as issue_id, date '2011-02-01' as attr_timestamp,
'OWNER' as attribute_name, '[email protected]' as attribute_value from dual union all
select 1 as issue_id, date '2011-03-01' as attr_timestamp,
'OWNER' as attribute_name, '[email protected]' as attribute_value from dual union all
select 1 as issue_id, date '2011-04-01' as attr_timestamp,
'OWNER' as attribute_name, '[email protected]' as attribute_value from dual union all
select 2 as issue_id, date '2011-01-01' as attr_timestamp,
'OWNER' as attribute_name, '[email protected]' as attribute_value from dual union all
select 2 as issue_id, date '2011-02-01' as attr_timestamp,
'OWNER' as attribute_name, '[email protected]' as attribute_value from dual union all
select 2 as issue_id, date '2011-03-01' as attr_timestamp,
'OWNER' as attribute_name, '[email protected]' as attribute_value from dual union all
select 2 as issue_id, date '2011-04-01' as attr_timestamp,
'OWNER' as attribute_name, '[email protected]' as attribute_value from dual
)
select
issue_id,
attr_timestamp,
attribute_name,
attribute_value,
case when lag(attribute_value) over (partition by issue_id, attribute_name order by attr_timestamp) = attribute_value then null else 'Y'end as keep_value
from
issue_attributes
This will add an additional column to say whether the data needs to be kept or not, which you can then filter on:
ISSUE_ID ATTR_TIMESTAMP ATTRIBUTE_NAME ATTRIBUTE_VALUE KEEP_VALUE
1 01/01/2011 OWNER [email protected] Y
1 01/02/2011 OWNER [email protected]
1 01/03/2011 OWNER [email protected]
1 01/04/2011 OWNER [email protected]
2 01/01/2011 OWNER [email protected] Y
2 01/02/2011 OWNER [email protected] Y
2 01/03/2011 OWNER [email protected] Y
2 01/04/2011 OWNER [email protected]
Upvotes: 1
Reputation: 44250
What you want to detect is: tuples with the same {issueid, attributename, attributevalue}, but (when ordered on timestamp) no intervening tuples with the same{issueid, attributename} but a different {attributevalue}.
That could be written as a query with one EXISTS, and one NOT EXISTS subquery.
UPDATE :
SET search_path='tmp';
-- The rows you want to delete.
SELECT * FROM issue_attributes to_del
WHERE EXISTS (
SELECT * FROM issue_attributes xx
WHERE xx.issue_id = to_del.issue_id
AND xx.attribute_name = to_del.attribute_name
AND xx.attribute_value = to_del.attribute_value
AND xx.attr_timestamp > to_del.attr_timestamp
AND NOT EXISTS ( SELECT * FROM issue_attributes nx
WHERE nx.issue_id = to_del.issue_id
AND nx.attribute_name = to_del.attribute_name
AND nx.attribute_value <> to_del.attribute_value
AND nx.attr_timestamp > to_del.attr_timestamp
AND nx.attr_timestamp < xx.attr_timestamp
)
) ;
-- For completeness: the rows you want to keep.
SELECT * FROM issue_attributes must_stay
WHERE NOT EXISTS (
SELECT * FROM issue_attributes xx
WHERE xx.issue_id = must_stay.issue_id
AND xx.attribute_name = must_stay.attribute_name
AND xx.attribute_value = must_stay.attribute_value
AND xx.attr_timestamp > must_stay.attr_timestamp
AND NOT EXISTS ( SELECT * FROM issue_attributes nx
WHERE nx.issue_id = must_stay.issue_id
AND nx.attribute_name = must_stay.attribute_name
AND nx.attribute_value <> must_stay.attribute_value
AND nx.attr_timestamp > must_stay.attr_timestamp
AND nx.attr_timestamp < xx.attr_timestamp
)
) ;
The result:
issue_id | attr_timestamp | attribute_name | attribute_value
----------+---------------------+----------------+----------------------
1 | 2011-03-01 12:00:00 | OWNER | [email protected]
1 | 2011-01-01 12:00:00 | OWNER | [email protected]
1 | 2011-02-01 12:00:00 | OWNER | [email protected]
2 | 2011-03-01 12:00:00 | OWNER | [email protected]
(4 rows)
issue_id | attr_timestamp | attribute_name | attribute_value
----------+---------------------+----------------+------------------------
1 | 2011-04-01 12:00:00 | OWNER | [email protected]
2 | 2011-02-01 12:00:00 | OWNER | [email protected]
2 | 2011-04-01 12:00:00 | OWNER | [email protected]
2 | 2011-01-01 12:00:00 | OWNER | [email protected]
(4 rows)
Upvotes: 0
Reputation: 14403
Here's a nice "Oracle" way to do it.
Using your sample data:
SQL> desc issue_attributes
Name Null? Type
----------------------------------------------------------------- -------- --------------------------------------------
ISSUE_ID NUMBER
ATTR_TIMESTAMP TIMESTAMP(6)
ATTRIBUTE_NAME VARCHAR2(500)
ATTRIBUTE_VALUE VARCHAR2(500)
SQL> select * from issue_attributes;
ISSUE_ID ATTR_TIMESTAMP ATTRIBUTE_ ATTRIBUTE_VALUE
---------- ----------------------------------- ---------- ------------------------------
1 01-JAN-20 11.12.00.000000 AM OWNER [email protected]
1 01-FEB-20 11.12.00.000000 AM OWNER [email protected]
1 01-MAR-20 11.12.00.000000 AM OWNER [email protected]
1 01-APR-20 11.12.00.000000 AM OWNER [email protected]
1 01-JAN-20 11.12.00.000000 AM OWNER [email protected]
1 01-JAN-20 11.12.00.000000 AM OWNER [email protected]
1 01-FEB-20 11.12.00.000000 AM OWNER [email protected]
1 01-MAR-20 11.12.00.000000 AM OWNER [email protected]
1 01-APR-20 11.12.00.000000 AM OWNER [email protected]
1 01-JAN-20 11.12.00.000000 AM OWNER [email protected]
1 01-FEB-20 11.12.00.000000 AM OWNER [email protected]
1 01-MAR-20 11.12.00.000000 AM OWNER [email protected]
12 rows selected.
SQL> delete from issue_attributes
where rowid in(select rid
from (select rowid rid,
row_number() over (partition by ISSUE_ID,
ATTR_TIMESTAMP,
ATTRIBUTE_NAME,
ATTRIBUTE_VALUE
order by rowid) rn
from issue_attributes)
where rn<> 1);
7 rows deleted.
SQL> select * from issue_attributes;
ISSUE_ID ATTR_TIMESTAMP ATTRIBUTE_ ATTRIBUTE_VALUE
---------- ----------------------------------- ---------- ------------------------------
1 01-JAN-20 11.12.00.000000 AM OWNER [email protected]
1 01-FEB-20 11.12.00.000000 AM OWNER [email protected]
1 01-MAR-20 11.12.00.000000 AM OWNER [email protected]
1 01-APR-20 11.12.00.000000 AM OWNER [email protected]
1 01-FEB-20 11.12.00.000000 AM OWNER [email protected]
5 rows selected.
Hope that helps.
Upvotes: 1
Reputation: 541
I don't know about Oracle in particular, but something like
SELECT MAX(attr_timestamp), issue_id, attribute_name, attribute_value
FROM issue_attributes
GROUP BY issue_id, attribute_name, attribute_value
would produce a list of where every distinct triple issue_id, attribute_name, attribute_value
appears, along with the most recent timestamp, in a number of DBMS. Might be worth a try.
Upvotes: 0