mipadi
mipadi

Reputation: 410632

How can I use SQL to select duplicate records, along with counts of related items?

I know the title of this question is a bit confusing, so bear with me. :)

I have a (MySQL) database with a Person record. A Person also has a slug field. Unfortunately, slug fields are not unique. There are a number of duplicate records, i.e., the records have different IDs but the same first name, last name, and slug. A Person may also have 0 or more associated articles, blog entries, and podcast episodes.

If that's confusing, here's a diagram of the structure:

alt text
(source: cbstaff.com)

I would like to produce a list of records that match this criteria: duplicate records (i.e., same slug field) for people who also have at least 1 article, blog entry, or podcast episode.

I have a SQL query that will list all records with the same slug fields:

SELECT
 id,
 first_name,
 last_name,
 slug,
 COUNT(slug) AS person_records
FROM
 people_person
GROUP BY
 slug
HAVING
 (COUNT(slug) > 1)
ORDER BY
 last_name, first_name, id;

But this includes records for people that may not have at least 1 article, blog entry, or podcast. Can I tweak this to fit the second criteria?

Edit:

I updated the database diagram to simplify it and make it more clear what I am doing. (Note, some of the DB table names changed -- I was trying to give a higher-level look at the structure before, but it was a bit unclear.)

Upvotes: 4

Views: 644

Answers (5)

Thomas
Thomas

Reputation: 64635

Select P.id, P.first_name, P.last_name, P.slug
From people_person as P
    Join    (
            Select P1.slug
            From people_person As P1
            Where Exists    (
                            Select 1
                            From magazine_author As ma1
                            Where ma1.person_id = P1.id
                            Union All
                            Select 1
                            From podcast_episode_guests As pod1
                            Where pod1.person_id = P1.Id
                            Union All
                            Select 1
                            From blogs_blog_authors As b1
                            Where b1.person_id = P1.Id
                            )
            Group By P1.slug
            Having Count(*) > 1
            ) As dup_slugs
        On dup_slugs.slug = P.slug
Order By P.last_name, P.first_name, P.id

Upvotes: 2

Andrew Russell
Andrew Russell

Reputation: 485

The other sql statements in the question and other answers are all incorrect, I will try to explain how to avoid the chicken and egg problem using a function (which makes the code much clearer):

SELECT  first_name,
        last_name, 
        slug,
        COUNT(slug) AS person_records,
        SUM(get_count_articles(id)) AS total_articles
FROM  people_person
GROUP BY first_name,
        last_name, 
        slug
HAVING  COUNT(*) > 1 AND SUM(get_count_articles(id))>=1
ORDER   BY  last_name, first_name;

With the function (written in Oracle syntax, please excuse my lack of knowledge of mysql functions).

FUNCTION get_count_articles(p_id NUMBER) RETURNS NUMBER IS
  l_mag_auth NUMBER;
  l_pod_guests NUMBER;
  l_blog_auth NUMBER;
BEGIN
  SELECT COUNT(*)
  INTO l_mag_auth
  FROM magazine_author ma1, article a1
  WHERE ma1.person_id = p_id;

  SELECT COUNT(*) 
  INTO l_pod_guests
  FROM podcast_episode_guests As pod1
  WHERE pod1.person_id = p_id;

  SELECT COUNT(*)
  INTO l_blog_auth
  FROM blogs_blog_authors As b1
  WHERE b1.person_id = p_id;

  RETURN l_mag_auth+l_pod_guests+l_blog_auth;
END;

Note1: The magazine_author should be linked to article as above because the there may not actually be an article.

Note2: I have removed the ID from the original questions select and group by because it will force the wrong answer (as id should be unique in the table no record will be returned EVER). The syntax count(slug) may be confusing the issue here. If the output requires both of the duplicate rows then you MUST re-link to the people_person table to show the list of id's for the slug.

Upvotes: 0

Gabriel Solomon
Gabriel Solomon

Reputation: 30035

SELECT
 id,
 first_name,
 last_name,
 slug,
 COUNT(slug) AS person_records,
FROM
 people_person
WHERE 
 id IN (SELECT person_id from podcast_guests GROUP BY person_id) OR 
 id IN (SELECT person_id from authors GROUP BY person_id) OR 
 [....]
GROUP BY
 slug
HAVING
 (COUNT(slug) > 1)
ORDER BY
 last_name, first_name, id;

Upvotes: 1

Will Marcouiller
Will Marcouiller

Reputation: 24132

You could perhaps handle it through the having clause:

select Id
        , last_name
        , first_name
        , slug
        , COUNT(*) as Person_Records
    from Person as p
    group by Id
            , last_name
            , first_name
            , slug
        having COUNT(slug) > 1
            and ( 
                select COUNT(*)
                    from Author as a
                    where a.Person_Id = p.Id
            ) > 1
            and (
                select COUNT(*)
                    from Podcast_Guests as pg
                    where pg.Person_Id = p.Id
            ) > 1

I omitted the remaining conditions as this is a simple sample.

I hope this helps! =)

Upvotes: 1

Marcus Adams
Marcus Adams

Reputation: 53830

You can still include a WHERE clause to filter the results:

SELECT
 id,
 first_name,
 last_name,
 slug,
 COUNT(slug) AS person_records
FROM
 people_person
WHERE id IN (SELECT id FROM article)
GROUP BY
 slug
HAVING
 (COUNT(slug) > 1)
ORDER BY
 last_name, first_name, id;

Upvotes: 1

Related Questions