Anseur
Anseur

Reputation: 53

SQL to show and group by occurrences of multiple column duplicates

I have a table that looks something like this but with lots of rows:

+-----------+-------------+------------+-------+-----------+-------+---------+--------+---------+
| systemid  | colcitrixid | colleague  |  time |   date    |  week |  period |  year  |  usern  |
| 32855671  |   nameu02   |  somename  | 10:12 | 2019-11-01|   23  |    5    |  2019  | othname |
|
|
|   ...     |     ...     |    ...     |  ...  |    ...    |  ...  |   ...   |  ...   |   ...   |
+-----------+-------------+------------+-------+-----------+-------+---------+--------+---------+

(column names summaried and some missed out so it fits in this stackoverflow editor width.)

I would like to run a query to get a list of duplicate rows where the systemid, the colcitrixid and the date are all the same so I can put them into another table to work on.

Here is what I have so far:

SELECT
    systemid,
    colleaguecitrixid,
    colleague,
    TIME,
    DATE,
    WEEK,
    period,
    fincyear,
    username,
    COUNT(systemid) AS mids,
    COUNT(colleaguecitrixid) AS colleagues,
    GROUP_CONCAT(DISTINCT id SEPARATOR ',') AS ids
FROM
    table_name_2019
WHERE
    period = '9' AND calltype = 'Advice' AND enabled = '1' AND isarchived = '0' AND DATE <> '$todaysdate'
GROUP BY
    systemid,
    colleague,
    DATE
HAVING
    COUNT(systemid) > 1 AND COUNT(colleaguecitrixid) > 1

It almost works, but I'm missing something, because it is showing duplicates that include different colleaguecitrixid entries. It's doing it's result based on these rows from the orignal table like so

+-----------+-------------+------------+-------+-----------+-------+---------+--------+---------+
| systemid  | colcitrixid | colleague  |  time |   date    |  week |  period |  year  |  usern  |
| 32852671  |   nameu02   |  Jack      | 10:12 | 2019-11-01|   23  |    5    |  2019  | othname |
| 32852671  |   nameu02   |  Jack      | 10:14 | 2019-11-01|   23  |    5    |  2019  | name5   |
| 32852671  |   nameu09   |  James     | 10:14 | 2019-11-01|   23  |    5    |  2019  | name5   |
| 34855673  |   nameu05   |  Bob       | 11:18 | 2019-11-03|   23  |    5    |  2019  | name2   |
| 34855673  |   nameu05   |  Bob       | 11:18 | 2019-11-03|   23  |    5    |  2019  | othname |
| 32851672  |   nameu08   |  Sarah     | 13:17 | 2019-11-01|   23  |    5    |  2019  | name2   |
| 32851672  |   nameu08   |  Sarah     | 13:19 | 2019-11-01|   23  |    5    |  2019  | name3   |
| 32851672  |   nameu08   |  Sarah     | 13:21 | 2019-11-01|   23  |    5    |  2019  | name1   |
| 32855671  |   nameu06   |  Jim       | 10:19 | 2019-11-01|   23  |    5    |  2019  | othname |
| 32855671  |   nameu06   |  Jim       | 10:22 | 2019-11-01|   23  |    5    |  2019  | othname |
|   ...     |     ...     |    ...     |  ...  |    ...    |  ...  |   ...   |  ...   |   ...   |
+-----------+-------------+------------+-------+-----------+-------+---------+--------+---------+

(Note, row 3, where we see James, a different colleague, but the system id is the same as the rows above with Jack)

but I would like it to only include the ones that are duplicated systemid AND colleaguecitrixid AND date, grouped together a bit like so:

+-----------+-------------+------------+-------+-----------+-------+---------+--------+---------+
| systemid  | colcitrixid | colleague  |  time |   date    |  week |  period |  year  |  usern  |
| 32852671  |   nameu02   |  Jack      | 10:12 | 2019-11-01|   23  |    5    |  2019  | othname |
| 32852671  |   nameu02   |  Jack      | 10:14 | 2019-11-01|   23  |    5    |  2019  | name5   |
| 34855673  |   nameu05   |  Bob       | 11:18 | 2019-11-03|   23  |    5    |  2019  | name2   |
| 34855673  |   nameu05   |  Bob       | 11:18 | 2019-11-03|   23  |    5    |  2019  | othname |
| 32851672  |   nameu08   |  Sarah     | 13:17 | 2019-11-01|   23  |    5    |  2019  | name2   |
| 32851672  |   nameu08   |  Sarah     | 13:19 | 2019-11-01|   23  |    5    |  2019  | name3   |
| 32851672  |   nameu08   |  Sarah     | 13:21 | 2019-11-01|   23  |    5    |  2019  | name1   |
| 32855671  |   nameu06   |  Jim       | 10:19 | 2019-11-01|   23  |    5    |  2019  | othname |
| 32855671  |   nameu06   |  Jim       | 10:22 | 2019-11-01|   23  |    5    |  2019  | othname |
|   ...     |     ...     |    ...     |  ...  |    ...    |  ...  |   ...   |  ...   |   ...   |
+-----------+-------------+------------+-------+-----------+-------+---------+--------+---------+

(I'm showing the rows from the orignal table the result seems to be drawing from above, I understand the result of my query is not formatted like the above, I need the other parts and format of the query for later steps.)

Upvotes: 1

Views: 61

Answers (2)

Barbaros &#214;zhan
Barbaros &#214;zhan

Reputation: 65228

If your DB's version is 10.2+, then consider using

COUNT(colleaguecitrixid) OVER (PARTITION BY systemid, colleague, date) analytic function :

SELECT systemid,
       colleaguecitrixid,
       colleague,
       time,
       date,
       week,
       period,
       fincyear,
       username,
       mids,
       colleagues,
       ids 
  FROM
  (
   SELECT t.*,      
          COUNT(systemid) AS mids,
          COUNT(colleaguecitrixid) OVER (PARTITION BY systemid, colleague, date) AS colleagues,
          GROUP_CONCAT(DISTINCT id SEPARATOR ',') AS ids
     FROM table_name_2019 t
    WHERE period = '9' 
      AND calltype = 'Advice' 
      AND enabled = '1' 
      AND isarchived = '0' 
      AND DATE <> '$todaysdate'
  ) t1  
  WHERE colleagues > 1

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269753

systemid, the colcitrixid and the date are all the same

The query should look like this:

SELECT systemid, colcitrixid, date, COUNT(*),
       GROUP_CONCAT(DISTINCT id SEPARATOR ',') AS ids
FROM table_name_2019
WHERE period = '9' AND
      calltype = 'Advice' AND
      enabled = '1' AND isarchived = '0' AND
      DATE <> CURDATE()
GROUP BY systemid, colcitrixid, date
HAVING COUNT(*) > 1;

I don't see colcitrixid in the data. If you really mean colleaguecitrixid, then use that.

Note:

  • In an aggregation query, all columns not in the GROUP BY should be arguments to aggregation columns.
  • MySQL has a function to return the date. There is no need to pass that value in under most circumstances.
  • The DISTINCT in GROUP_CONCAT() is probably unnecessary. A column called id should not have duplicates.

Upvotes: 1

Related Questions