SQL SERVER - Finding Duplicates on XML Values

Question

I have a movie table with 2 columns. ID(int) and MetaData(XML). The MetaData looks like this:


  
    aaa
    bbb
    bbb
    ddd
    ddd
    adadad
    ch
    cm
    sss
    ggg
    eee
    ccc
    ddd

I want to find records where there are duplicates within a credit type - Here actor "bbb" is a duplicate (but "ddd" is not).

If I have a query like below, it even throws records where the actor is also the director. But I don’t want them to show up.

-- Check for Duplicate Cast and Crew
WITH XMLNAMESPACES (DEFAULT 'urn:schemas-xxx:yyy:catalog')
SELECT Count(*)
FROM Movie
WHERE Metadata.value('count(/movie/credits/credit)', 'int') <> Metadata.value('count(distinct-values(/movie/credits/credit))', 'int')

If I modify my query like below, it works.

WITH XMLNAMESPACES (DEFAULT 'urn:schemas-xxx:yyy:catalog')
SELECT Count(*)
FROM Movie
WHERE 
 (
    (Metadata.value('count(/movie/credits/credit[@creditType="Actor"])', 'int') <> 
        Metadata.value('count(distinct-values(/movie/credits/credit[@creditType="Actor"]))', 'int')
        )

    OR (Metadata.value('count(/movie/credits/credit[@creditType="Director"])', 'int') <> 
        Metadata.value('count(distinct-values(/movie/credits/credit[@creditType="Director"]))', 'int')
        )
     OR (Metadata.value('count(/movie/credits/credit[@creditType="Producer"])', 'int') <> 
        Metadata.value('count(distinctvalues(/movie/credits/credit[@creditType="Producer"]))', 'int')
        )
)

But there are a lot of credit types like composer, editor etc and I don’t want to do it this way for every credit type. Is there any efficient way to do this?

Update:

I found that the previous query did case-sensitive search. I needed a case-insensitive one, so changed it like the one below:

WITH XMLNAMESPACES (DEFAULT 'urn:xxx:yyy:catalog')
SELECT Count(*) FROM
(
SELECT  ID
FROM Movie
CROSS APPLY
Movie.Metadata.nodes('/movie/credits/credit[@creditType="Actor"]') x(y)
GROUP BY ID
HAVING 
 COUNT(y.value('.', 'varchar(100)')) <> COUNT(Distinct y.value('.', 'varchar(100)'))
) AS temp;

But My original problem still remains.

Mikael Eriksson · Accepted Answer

You can use a FLOWR and check the count for each distinct value of @creditType. Return a dummy node and check for existence of nodes using exist().

with xmlnamespaces(default 'urn:schemas-xxx:yyy:catalog')
select count(*)
from Movie as M
where M.Metadata.exist('
  for $creditType in distinct-values(/movie/credits/credit/@creditType)
  where count(distinct-values(/movie/credits/credit[@creditType = $creditType]/text())) != count(/movie/credits/credit[@creditType = $creditType]/text())
  return ') = 1;

SQL-Fiddle

SQL SERVER - Finding Duplicates on XML Values

Answers (1)

Related Questions