Leslie Harrison
Leslie Harrison

Reputation: 11

SQL: Filter on Any Data in Delimited Column

I want to include records that only have any value in a specific location within a delimited string. For example, in the strings below, I want to only include the one in which there is data in column5. So, only Example B qualifies:

Example A: PV1|column1data||column3data|column4data||||column8data

Example B: PV1|column1data||column3data||column5data|||column8data

Upvotes: 1

Views: 469

Answers (2)

Jaaz Cole
Jaaz Cole

Reputation: 3180

The LIKE operator is slow, and will be something which will be visibly slow once you hit a few hundred thousand records. This answer isn't super quick, but will do the job and doesn't require SQL Server CLR. If your chosen system is Oracle, you will need 11g to use the same syntax.

Presuming you have a table...

create table udata (
    ID int primary key identity(1,1)
  , string varchar(2000) not null
);

You'll want a variable for this for any future changes, but also for script flexibility.

declare @delimiter varchar(1) = '|'

Then set everything up in a CTE.

;with parser as (
    select 
          ID
        , 0 as colNum
        , substring(d.string, endPos + (2 * delimLen), len(d.string)) as string
        , startPos
        , endPos
    from udata d
        cross apply (
            select
                  len(@delimiter) as startPos
                , case charindex(@delimiter,d.string) when 0 then len(d.string) + len(@delimiter) else charindex(@delimiter,d.string) end  - len(@delimiter) as endPos
                , len(@delimiter) as delimLen
        ) p
    where id between 2000 and 10000
    union all
    select 
          ID
        , colNum + 1 as colNum
        , substring(d.string, p.endPos + (2 * delimLen), len(d.string)) as string
        , d.endPos + (2 * delimLen) as startPos
        , d.endPos + (delimLen) + p.endPos as endPos
    from parser d
        cross apply (
            select
                  len(@delimiter) as startPos
                , case charindex(@delimiter,d.string) when 0 then len(d.string) + len(@delimiter) else charindex(@delimiter,d.string) end  - len(@delimiter) as endPos
                , len(@delimiter) as delimLen
        ) p
    where string != ''
), selector as (
    select u.id, p.colNum, substring(u.string, p.startPos, p.endPos - p.startPos + len(@delimiter)) as colVal--,u.string, p.startPos, p.endPos
    from udata u
        inner join parser p
            on p.ID = u.ID
)

What this does is first mark the locations for the begin and end of each column value, then the selector slices it out of the source string. Note the where clause in the first part of the recursive query: where id between 2000 and 10000 You will want to limit your records here. This might be where you limit it to the types of records you are looking for.

Finally, select out your columns in a pivot for ease of reading:

select *
from selector
    pivot (
        max(colVal) for colNum in ([1],[2],[3],[4],[5],[6],[7],[8])
    ) pv

The original rows can instead be returned using your original criteria like this:

select *
from udata u
where exists (
        select top 1 1
        from selector s
        where s.colNum = 5
            and s.colVal =''
            and s.ID = u.ID
    )

My test data contains ~160K rows, and the exists query, without the ID limits in the CTE, took 18 seconds to run on laptop hardware. Still, this should do the trick.

Upvotes: 0

Tim Lehner
Tim Lehner

Reputation: 15251

If your data always has the same format of 9 items and 8 separators:

select * from tab where col not like '%|%|%|%|%||%|%|%'

That said, there may be better options available in your RDBMS if you'd let us know which you're using. Also, storing multiple items in one column is one of the worst anti-patterns in a relational database.

Update for jagged data

For an unknown number of elements in your data, this could work in SQL server (depending upon whitespace and ANSI settings):

select *
from tab
where col not like replicate('%|', 5) + replicate('|%', len(col) - len(replace(col, '|', '')) - 5)

Here we're calculating the number of elements for each value and dynamically creating the like pattern to work as it does above. Other databases should have similar functions allowing much the same logic. I'm not sure how much better the answers can get without having more info.

Upvotes: 1

Related Questions