Reputation: 339

How to find repeated, timestamped data

I have a table as follows:

create table issue_attributes (
  issue_id number,
  attr_timestamp timestamp,
  attribute_name varchar2(500),
  attribute_value varchar2(500),
  CONSTRAINT ia-pk PRIMARY KEY (issue_id, attr_timestamp, attribute_name)
)

The idea here is to have a bunch of attributes (status, owner, etc) associated with an issue, while retaining the ability to keep a history of attribute changes.

Due to errors in data imports, we have repeated data in the table:

select issue_id, attr_timestamp, attribute_name, attribute_value
from issue_attributes where issue_id = 1 and attribute_name = 'OWNER';

yields sample data of:

1, 01-JAN-2011 12:00, 'OWNER', '[email protected]'
1, 01-FEB-2011 12:00, 'OWNER', '[email protected]'
1, 01-MAR-2011 12:00, 'OWNER', '[email protected]'
1, 01-APR-2011 12:00, 'OWNER', '[email protected]'

I would like to be able to find all instances of repeated attributes and just keep the latest one. The desired result set in this case of sample data would be:

1, 01-JAN-2011 12:00, 'OWNER', '[email protected]'

We could also have a case of the sample data being:

2, 01-JAN-2011 12:00, 'OWNER', '[email protected]'
2, 01-FEB-2011 12:00, 'OWNER', '[email protected]'
2, 01-MAR-2011 12:00, 'OWNER', '[email protected]'
2, 01-APR-2011 12:00, 'OWNER', '[email protected]'

in which case I'd like to have the result be:

2, 01-JAN-2011 12:00, 'OWNER', '[email protected]'
2, 01-FEB-2011 12:00, 'OWNER', '[email protected]'
2, 01-MAR-2011 12:00, 'OWNER', '[email protected]'

This is on Oracle 11g, so I could use SQL or PL/SQL to fix the data. I was thinking that one way to do it is via PL/SQL, and for each issue_id, descend sort the attributes, and if attribute(x) = attribute(x-1), then delete attribute(x). That seems a bit much like brute force, and I was wonder if there was an elegant way to do this just via SQL.

Upvotes: 1

Answers (4)

Mike Meyers

Reputation: 2895

I would looking at the previous row and seeing if the data has changed. This can be done by using the LAG analytic function.

You can look back at the previous value, ordering on the timestamp. If the data has changed, then you want to keep it. The first row is always kept because LAG returns NULL when there is no prior data.

with issue_attributes as (
  select 1 as issue_id, date '2011-01-01' as attr_timestamp, 
    'OWNER' as attribute_name, '[email protected]' as attribute_value from dual union all
  select 1 as issue_id, date '2011-02-01' as attr_timestamp, 
    'OWNER' as attribute_name, '[email protected]' as attribute_value from dual union all
  select 1 as issue_id, date '2011-03-01' as attr_timestamp, 
    'OWNER' as attribute_name, '[email protected]' as attribute_value from dual union all
  select 1 as issue_id, date '2011-04-01' as attr_timestamp, 
    'OWNER' as attribute_name, '[email protected]' as attribute_value from dual union all
  select 2 as issue_id, date '2011-01-01' as attr_timestamp, 
    'OWNER' as attribute_name, '[email protected]' as attribute_value from dual union all
  select 2 as issue_id, date '2011-02-01' as attr_timestamp, 
    'OWNER' as attribute_name, '[email protected]' as attribute_value from dual union all
  select 2 as issue_id, date '2011-03-01' as attr_timestamp, 
    'OWNER' as attribute_name, '[email protected]' as attribute_value from dual union all
  select 2 as issue_id, date '2011-04-01' as attr_timestamp, 
    'OWNER' as attribute_name, '[email protected]' as attribute_value from dual 
)
select 
  issue_id, 
  attr_timestamp, 
  attribute_name, 
  attribute_value,
  case when lag(attribute_value) over (partition by issue_id, attribute_name order by attr_timestamp) = attribute_value then null else 'Y'end as keep_value
from 
  issue_attributes

This will add an additional column to say whether the data needs to be kept or not, which you can then filter on:

ISSUE_ID ATTR_TIMESTAMP ATTRIBUTE_NAME ATTRIBUTE_VALUE        KEEP_VALUE
1        01/01/2011     OWNER          [email protected]   Y
1        01/02/2011     OWNER          [email protected] 
1        01/03/2011     OWNER          [email protected] 
1        01/04/2011     OWNER          [email protected] 
2        01/01/2011     OWNER          [email protected]   Y
2        01/02/2011     OWNER          [email protected] Y
2        01/03/2011     OWNER          [email protected]   Y
2        01/04/2011     OWNER          [email protected]

Upvotes: 1

wildplasser

Reputation: 44250

What you want to detect is: tuples with the same {issueid, attributename, attributevalue}, but (when ordered on timestamp) no intervening tuples with the same{issueid, attributename} but a different {attributevalue}.

That could be written as a query with one EXISTS, and one NOT EXISTS subquery.

UPDATE :

SET search_path='tmp';

-- The rows you want to delete.
SELECT * FROM issue_attributes to_del
WHERE EXISTS (
    SELECT * FROM issue_attributes xx
    WHERE xx.issue_id = to_del.issue_id
    AND xx.attribute_name = to_del.attribute_name
    AND xx.attribute_value = to_del.attribute_value
    AND xx.attr_timestamp > to_del.attr_timestamp
    AND NOT EXISTS ( SELECT * FROM issue_attributes nx
        WHERE nx.issue_id = to_del.issue_id
        AND nx.attribute_name = to_del.attribute_name
        AND nx.attribute_value <> to_del.attribute_value
        AND nx.attr_timestamp > to_del.attr_timestamp
        AND nx.attr_timestamp < xx.attr_timestamp
        )   
    ) ;

-- For completeness: the rows you want to keep.
SELECT * FROM issue_attributes must_stay
WHERE NOT EXISTS (
    SELECT * FROM issue_attributes xx
    WHERE xx.issue_id = must_stay.issue_id
    AND xx.attribute_name = must_stay.attribute_name
    AND xx.attribute_value = must_stay.attribute_value
    AND xx.attr_timestamp > must_stay.attr_timestamp
    AND NOT EXISTS ( SELECT * FROM issue_attributes nx
        WHERE nx.issue_id = must_stay.issue_id
        AND nx.attribute_name = must_stay.attribute_name
        AND nx.attribute_value <> must_stay.attribute_value
        AND nx.attr_timestamp > must_stay.attr_timestamp
        AND nx.attr_timestamp < xx.attr_timestamp
        )
    ) ;

The result:

 issue_id |   attr_timestamp    | attribute_name |   attribute_value    
----------+---------------------+----------------+----------------------
        1 | 2011-03-01 12:00:00 | OWNER          | [email protected]
        1 | 2011-01-01 12:00:00 | OWNER          | [email protected]
        1 | 2011-02-01 12:00:00 | OWNER          | [email protected]
        2 | 2011-03-01 12:00:00 | OWNER          | [email protected]
(4 rows)

 issue_id |   attr_timestamp    | attribute_name |    attribute_value     
----------+---------------------+----------------+------------------------
        1 | 2011-04-01 12:00:00 | OWNER          | [email protected]
        2 | 2011-02-01 12:00:00 | OWNER          | [email protected]
        2 | 2011-04-01 12:00:00 | OWNER          | [email protected]
        2 | 2011-01-01 12:00:00 | OWNER          | [email protected]
(4 rows)

Upvotes: 0

Mark J. Bobak

Reputation: 14403

Here's a nice "Oracle" way to do it.

Using your sample data:

SQL> desc issue_attributes
 Name                                                              Null?    Type
 ----------------------------------------------------------------- -------- --------------------------------------------
 ISSUE_ID                                                                   NUMBER
 ATTR_TIMESTAMP                                                             TIMESTAMP(6)
 ATTRIBUTE_NAME                                                             VARCHAR2(500)
 ATTRIBUTE_VALUE                                                            VARCHAR2(500)

SQL> select * from issue_attributes;

  ISSUE_ID ATTR_TIMESTAMP                      ATTRIBUTE_ ATTRIBUTE_VALUE
---------- ----------------------------------- ---------- ------------------------------
         1 01-JAN-20 11.12.00.000000 AM        OWNER      [email protected]
         1 01-FEB-20 11.12.00.000000 AM        OWNER      [email protected]
         1 01-MAR-20 11.12.00.000000 AM        OWNER      [email protected]
         1 01-APR-20 11.12.00.000000 AM        OWNER      [email protected]
         1 01-JAN-20 11.12.00.000000 AM        OWNER      [email protected]
         1 01-JAN-20 11.12.00.000000 AM        OWNER      [email protected]
         1 01-FEB-20 11.12.00.000000 AM        OWNER      [email protected]
         1 01-MAR-20 11.12.00.000000 AM        OWNER      [email protected]
         1 01-APR-20 11.12.00.000000 AM        OWNER      [email protected]
         1 01-JAN-20 11.12.00.000000 AM        OWNER      [email protected]
         1 01-FEB-20 11.12.00.000000 AM        OWNER      [email protected]
         1 01-MAR-20 11.12.00.000000 AM        OWNER      [email protected]

12 rows selected.

SQL> delete from issue_attributes
        where rowid in(select rid
                         from (select rowid rid,
                                      row_number() over (partition by ISSUE_ID,
                                                                      ATTR_TIMESTAMP,
                                                                      ATTRIBUTE_NAME,
                                                                      ATTRIBUTE_VALUE
                                                             order by rowid) rn
                                from issue_attributes)
                        where rn<> 1);
7 rows deleted.

SQL> select * from issue_attributes;

  ISSUE_ID ATTR_TIMESTAMP              ATTRIBUTE_ ATTRIBUTE_VALUE
---------- ----------------------------------- ---------- ------------------------------
         1 01-JAN-20 11.12.00.000000 AM        OWNER      [email protected]
         1 01-FEB-20 11.12.00.000000 AM        OWNER      [email protected]
         1 01-MAR-20 11.12.00.000000 AM        OWNER      [email protected]
         1 01-APR-20 11.12.00.000000 AM        OWNER      [email protected]
         1 01-FEB-20 11.12.00.000000 AM        OWNER      [email protected]

5 rows selected.

Hope that helps.

Upvotes: 1

user128536

Reputation: 541

I don't know about Oracle in particular, but something like

SELECT MAX(attr_timestamp), issue_id, attribute_name, attribute_value
FROM issue_attributes
GROUP BY issue_id, attribute_name, attribute_value

would produce a list of where every distinct triple issue_id, attribute_name, attribute_value appears, along with the most recent timestamp, in a number of DBMS. Might be worth a try.

Upvotes: 0

How to find repeated, timestamped data

Answers (4)

Related Questions