Sudharsanan
Sudharsanan

Reputation: 587

SQL SERVER - Finding Duplicates on XML Values

I have a movie table with 2 columns. ID(int) and MetaData(XML). The MetaData looks like this:

<movie xmlns="urn:schemas-xxx:yyy:catalog" >
  <credits>
    <credit creditId="15594954" creditType="Actor" >aaa</credit>
    <credit creditId="15573106" creditType="Actor" >bbb</credit>
    <credit creditId="15781056" creditType="Actor" >bbb</credit>
    <credit creditId="15781056" creditType="Actor" >ddd</credit>
    <credit creditId="15606109" creditType="Director" >ddd</credit>
    <credit creditId="16316911" creditType="Art Director" >adadad</credit>
    <credit creditId="18484117" creditType="Choreographer" >ch</credit>
    <credit creditId="15707268" creditType="Cinematographer" >cm</credit>
    <credit creditId="15907445" creditType="Screenwriter">sss</credit>
    <credit creditId="15905546" creditType="Screenwriter" >ggg</credit>
    <credit creditId="16493602" creditType="Editor" >eee</credit>
    <credit creditId="15825749" creditType="Composer" >ccc</credit>
    <credit creditId="18486706" creditType="Composer" >ddd</credit>
  </credits>
</movie>

I want to find records where there are duplicates within a credit type - Here actor "bbb" is a duplicate (but "ddd" is not).

If I have a query like below, it even throws records where the actor is also the director. But I don’t want them to show up.

-- Check for Duplicate Cast and Crew
WITH XMLNAMESPACES (DEFAULT 'urn:schemas-xxx:yyy:catalog')
SELECT Count(*)
FROM Movie
WHERE Metadata.value('count(/movie/credits/credit)', 'int') <> Metadata.value('count(distinct-values(/movie/credits/credit))', 'int')

 If I modify my query like below, it works.

WITH XMLNAMESPACES (DEFAULT 'urn:schemas-xxx:yyy:catalog')
SELECT Count(*)
FROM Movie
WHERE 
 (
    (Metadata.value('count(/movie/credits/credit[@creditType="Actor"])', 'int') <> 
        Metadata.value('count(distinct-values(/movie/credits/credit[@creditType="Actor"]))', 'int')
        )

    OR (Metadata.value('count(/movie/credits/credit[@creditType="Director"])', 'int') <> 
        Metadata.value('count(distinct-values(/movie/credits/credit[@creditType="Director"]))', 'int')
        )
     OR (Metadata.value('count(/movie/credits/credit[@creditType="Producer"])', 'int') <> 
        Metadata.value('count(distinctvalues(/movie/credits/credit[@creditType="Producer"]))', 'int')
        )
)

But there are a lot of credit types like composer, editor etc and I don’t want to do it this way for every credit type. Is there any efficient way to do this?

Update:

I found that the previous query did case-sensitive search. I needed a case-insensitive one, so changed it like the one below:

WITH XMLNAMESPACES (DEFAULT 'urn:xxx:yyy:catalog')
SELECT Count(*) FROM
(
SELECT  ID
FROM Movie
CROSS APPLY
Movie.Metadata.nodes('/movie/credits/credit[@creditType="Actor"]') x(y)
GROUP BY ID
HAVING 
 COUNT(y.value('.', 'varchar(100)')) <> COUNT(Distinct y.value('.', 'varchar(100)'))
) AS temp;

But My original problem still remains.

Upvotes: 2

Views: 1076

Answers (1)

Mikael Eriksson
Mikael Eriksson

Reputation: 138960

You can use a FLOWR and check the count for each distinct value of @creditType. Return a dummy node and check for existence of nodes using exist().

with xmlnamespaces(default 'urn:schemas-xxx:yyy:catalog')
select count(*)
from Movie as M
where M.Metadata.exist('
  for $creditType in distinct-values(/movie/credits/credit/@creditType)
  where count(distinct-values(/movie/credits/credit[@creditType = $creditType]/text())) != count(/movie/credits/credit[@creditType = $creditType]/text())
  return <X/>') = 1;

SQL-Fiddle

Upvotes: 1

Related Questions