Reputation: 587
I have a movie table with 2 columns. ID(int) and MetaData(XML). The MetaData looks like this:
<movie xmlns="urn:schemas-xxx:yyy:catalog" >
<credits>
<credit creditId="15594954" creditType="Actor" >aaa</credit>
<credit creditId="15573106" creditType="Actor" >bbb</credit>
<credit creditId="15781056" creditType="Actor" >bbb</credit>
<credit creditId="15781056" creditType="Actor" >ddd</credit>
<credit creditId="15606109" creditType="Director" >ddd</credit>
<credit creditId="16316911" creditType="Art Director" >adadad</credit>
<credit creditId="18484117" creditType="Choreographer" >ch</credit>
<credit creditId="15707268" creditType="Cinematographer" >cm</credit>
<credit creditId="15907445" creditType="Screenwriter">sss</credit>
<credit creditId="15905546" creditType="Screenwriter" >ggg</credit>
<credit creditId="16493602" creditType="Editor" >eee</credit>
<credit creditId="15825749" creditType="Composer" >ccc</credit>
<credit creditId="18486706" creditType="Composer" >ddd</credit>
</credits>
</movie>
I want to find records where there are duplicates within a credit type - Here actor "bbb" is a duplicate (but "ddd" is not).
If I have a query like below, it even throws records where the actor is also the director. But I don’t want them to show up.
-- Check for Duplicate Cast and Crew
WITH XMLNAMESPACES (DEFAULT 'urn:schemas-xxx:yyy:catalog')
SELECT Count(*)
FROM Movie
WHERE Metadata.value('count(/movie/credits/credit)', 'int') <> Metadata.value('count(distinct-values(/movie/credits/credit))', 'int')
If I modify my query like below, it works.
WITH XMLNAMESPACES (DEFAULT 'urn:schemas-xxx:yyy:catalog')
SELECT Count(*)
FROM Movie
WHERE
(
(Metadata.value('count(/movie/credits/credit[@creditType="Actor"])', 'int') <>
Metadata.value('count(distinct-values(/movie/credits/credit[@creditType="Actor"]))', 'int')
)
OR (Metadata.value('count(/movie/credits/credit[@creditType="Director"])', 'int') <>
Metadata.value('count(distinct-values(/movie/credits/credit[@creditType="Director"]))', 'int')
)
OR (Metadata.value('count(/movie/credits/credit[@creditType="Producer"])', 'int') <>
Metadata.value('count(distinctvalues(/movie/credits/credit[@creditType="Producer"]))', 'int')
)
)
But there are a lot of credit types like composer, editor etc and I don’t want to do it this way for every credit type. Is there any efficient way to do this?
Update:
I found that the previous query did case-sensitive search. I needed a case-insensitive one, so changed it like the one below:
WITH XMLNAMESPACES (DEFAULT 'urn:xxx:yyy:catalog')
SELECT Count(*) FROM
(
SELECT ID
FROM Movie
CROSS APPLY
Movie.Metadata.nodes('/movie/credits/credit[@creditType="Actor"]') x(y)
GROUP BY ID
HAVING
COUNT(y.value('.', 'varchar(100)')) <> COUNT(Distinct y.value('.', 'varchar(100)'))
) AS temp;
But My original problem still remains.
Upvotes: 2
Views: 1076
Reputation: 138960
You can use a FLOWR and check the count for each distinct value of @creditType
. Return a dummy node and check for existence of nodes using exist()
.
with xmlnamespaces(default 'urn:schemas-xxx:yyy:catalog')
select count(*)
from Movie as M
where M.Metadata.exist('
for $creditType in distinct-values(/movie/credits/credit/@creditType)
where count(distinct-values(/movie/credits/credit[@creditType = $creditType]/text())) != count(/movie/credits/credit[@creditType = $creditType]/text())
return <X/>') = 1;
Upvotes: 1