Reputation: 11
I want to include records that only have any value in a specific location within a delimited string. For example, in the strings below, I want to only include the one in which there is data in column5. So, only Example B qualifies:
Example A: PV1|column1data||column3data|column4data||||column8data
Example B: PV1|column1data||column3data||column5data|||column8data
Upvotes: 1
Views: 469
Reputation: 3180
The LIKE
operator is slow, and will be something which will be visibly slow once you hit a few hundred thousand records. This answer isn't super quick, but will do the job and doesn't require SQL Server CLR. If your chosen system is Oracle, you will need 11g to use the same syntax.
Presuming you have a table...
create table udata (
ID int primary key identity(1,1)
, string varchar(2000) not null
);
You'll want a variable for this for any future changes, but also for script flexibility.
declare @delimiter varchar(1) = '|'
Then set everything up in a CTE.
;with parser as (
select
ID
, 0 as colNum
, substring(d.string, endPos + (2 * delimLen), len(d.string)) as string
, startPos
, endPos
from udata d
cross apply (
select
len(@delimiter) as startPos
, case charindex(@delimiter,d.string) when 0 then len(d.string) + len(@delimiter) else charindex(@delimiter,d.string) end - len(@delimiter) as endPos
, len(@delimiter) as delimLen
) p
where id between 2000 and 10000
union all
select
ID
, colNum + 1 as colNum
, substring(d.string, p.endPos + (2 * delimLen), len(d.string)) as string
, d.endPos + (2 * delimLen) as startPos
, d.endPos + (delimLen) + p.endPos as endPos
from parser d
cross apply (
select
len(@delimiter) as startPos
, case charindex(@delimiter,d.string) when 0 then len(d.string) + len(@delimiter) else charindex(@delimiter,d.string) end - len(@delimiter) as endPos
, len(@delimiter) as delimLen
) p
where string != ''
), selector as (
select u.id, p.colNum, substring(u.string, p.startPos, p.endPos - p.startPos + len(@delimiter)) as colVal--,u.string, p.startPos, p.endPos
from udata u
inner join parser p
on p.ID = u.ID
)
What this does is first mark the locations for the begin and end of each column value, then the selector slices it out of the source string. Note the where clause in the first part of the recursive query: where id between 2000 and 10000
You will want to limit your records here. This might be where you limit it to the types of records you are looking for.
Finally, select out your columns in a pivot for ease of reading:
select *
from selector
pivot (
max(colVal) for colNum in ([1],[2],[3],[4],[5],[6],[7],[8])
) pv
The original rows can instead be returned using your original criteria like this:
select *
from udata u
where exists (
select top 1 1
from selector s
where s.colNum = 5
and s.colVal =''
and s.ID = u.ID
)
My test data contains ~160K rows, and the exists query, without the ID limits in the CTE, took 18 seconds to run on laptop hardware. Still, this should do the trick.
Upvotes: 0
Reputation: 15251
If your data always has the same format of 9 items and 8 separators:
select * from tab where col not like '%|%|%|%|%||%|%|%'
That said, there may be better options available in your RDBMS if you'd let us know which you're using. Also, storing multiple items in one column is one of the worst anti-patterns in a relational database.
Update for jagged data
For an unknown number of elements in your data, this could work in SQL server (depending upon whitespace and ANSI settings):
select *
from tab
where col not like replicate('%|', 5) + replicate('|%', len(col) - len(replace(col, '|', '')) - 5)
Here we're calculating the number of elements for each value and dynamically creating the like
pattern to work as it does above. Other databases should have similar functions allowing much the same logic. I'm not sure how much better the answers can get without having more info.
Upvotes: 1