Jason Francis
Jason Francis

Reputation: 1104

Counting consecutive duplicate records with SQL

I have a data-analysis question, that I could easily solve with some T-SQL or some scripting, but I was wondering if there was a clever SQL solution. The problem is that it messes a bit with SQL's row-independence assumption a bit.

I have a table that consists of name-value pairs associated with a user and ordered by submission, for example:

ID      USERID  VARIABLE        VALUE   SUBMITTED
3115    2287    votech05    2   2009-02-02 15:34:00
3116    2287    comcol05    1   2009-02-02 15:34:00
3117    2287    fouryr05    1   2009-02-02 15:35:00
3118    2287    none05          2   2009-02-02 15:35:00
3119    2287    ocol1_05    2   2009-02-02 15:44:00
3120    2287    disnone         2   2009-02-02 15:45:00
3121    2287    dissense    2   2009-02-02 15:49:00
3122    2287    dismobil    3   2009-02-02 15:51:00
3123    2287    dislearn    3   2009-02-02 15:51:00
3124    2287    disment         3   2009-02-02 15:52:00
3125    2287    disother    2   2009-02-02 15:55:00
3126    2287    disrefus    7   2009-02-02 15:58:00

I'd like to be able to determine the value and count of the largest group of identical values (when the data is ordered the ID primary key). So, for the above example, because I have four value=2 appearing in sequence, and only three value=3, I would want to report:

USERID     VALUE      COUNT
2287       2          4

for the given user.

Again, this would could be done fairly-quickly using other tools, but since the data set is quite large (about 75 million records) and frequently changing, it would be nice to be able to solve this problem with a query. I'm working with SQL Server 2005.

Upvotes: 3

Views: 4392

Answers (3)

eschlech
eschlech

Reputation:

without testing it I think that the following should work:

Row_number() over (partition by userid, value order by id)

once this is done just select the one with the highest row_nunber

Please let me know if this worked!!

Thanks, Edi

Upvotes: 0

Andomar
Andomar

Reputation: 238058

(Edited after comment)

You can do that by assigning a "head" number to each group of consecutive values. After that you select the head number for each row, and do an aggregate per head.

Here's an example, with CTE's for readability:

WITH
OrderedTable as (
    select value, rownr = row_number() over (order by userid, id)
    from YourTable
    where userid = 2287
),
Heads as (
    select cur.rownr, CurValue = cur.value
    ,   headnr = row_number() over (order by cur.rownr)
    from OrderedTable cur
    left join OrderedTable prev on cur.rownr = prev.rownr+1 
    where IsNull(prev.value,-1) != cur.value
),
ValuesWithHead as (
    select value
    ,   HeadNr = (select max(headnr) 
                from Heads 
                where Heads.rownr <= data.rownr)
    from OrderedTable data
)
select Value, [Count] = count(*)
from ValuesWithHead
group by HeadNr, value
order by count(*) desc

This will output:

Value   Count
2       4
3       3
1       2
2       1
2       1
7       1

Use "top 1" to select the first row only.

Here's my query to create the test data:

create table YourTable (
    id int primary key,
    userid int,
    variable varchar(25),
    value int
)
insert into YourTable (id, userid, variable, value) values (3115, 2287, 'votech05', 2)
insert into YourTable (id, userid, variable, value) values (3116, 2287, 'comcol05', 1)
insert into YourTable (id, userid, variable, value) values (3117, 2287, 'fouryr05', 1)
insert into YourTable (id, userid, variable, value) values (3118, 2287, 'none05', 2)
insert into YourTable (id, userid, variable, value) values (3119, 2287, 'ocol1_05', 2)
insert into YourTable (id, userid, variable, value) values (3120, 2287, 'disnone', 2)
insert into YourTable (id, userid, variable, value) values (3121, 2287, 'dissense', 2)
insert into YourTable (id, userid, variable, value) values (3122, 2287, 'dismobil', 3)
insert into YourTable (id, userid, variable, value) values (3123, 2287, 'dislearn', 3)
insert into YourTable (id, userid, variable, value) values (3124, 2287, 'disment', 3)
insert into YourTable (id, userid, variable, value) values (3125, 2287, 'disother', 2)
insert into YourTable (id, userid, variable, value) values (3126, 2287, 'disrefus', 7)

Upvotes: 3

Steve Kass
Steve Kass

Reputation: 7184

This may be one of those problems best solved with cursors. Give this a try. It should be close, but it's not tested, since you didn't provide CREATE TABLE and INSERT statements with sample data to make that easy.

declare @userid int
set @userid = 2287;
declare C cursor fast_forward for
select VALUE from T
where USERID = @userid
order by ID;

declare @value int, @prevvalue int;
declare @runcount int, @runlongest int;
set @runlongest = 0;
declare @valuelongest int;
open C;
fetch next from C into @value;
while @@fetch_status = 0 begin
  if @value = @prevvalue set @runcount = @runcount + 1 else set @runcount = 1;
  if @runcount > @runlongest begin
    set @runlongest = @runcount;
    set @valuelongest = @value;
  end;
  set @prevvalue = @value;
  fetch next from C into @value;
end;
select @userid as USERID, @valuelongest as VALUE, @runlongest as [COUNT];

close C;
deallocate C;

It won't be fast with 75M rows, but it probably won't be too slow, either. If your runs are very long, and you have the right indexes, you can do better by numbering the rows with row_number in a temp table, then using a WHILE loop that jumps through a run at a time. Let me know if you think that's worth looking at (and if you can, post CREATE TABLE and INSERT statements with sample data).

Upvotes: 2

Related Questions