Nate
Nate

Reputation: 28454

How to group rows where one or more columns are equal?

Let's say I have a database of products that have the following columns:

name
isbn
ean
upc
sku

And I want to group products together if any of those columns match. How can I do that?

I tried using the GROUP BY clause, but apparently that means that all of the columns have to match, not just a single one.

In other words, GROUP BY name, isbn, ean, upc, sku only groups products together when all of the columns are equal, where I'm wanting to group them together if any of the columns are equal.

Example:

id  |  name   |     isbn     |  ean  |    upc     | sku
 0    banana1       null        null   0000000000   1234
 1    banana2       6789        null   0000000000   1234 
 2    banana3       6789        null   1111111111   null
 3    banana4       null        null   1111111111   null

All of those rows should be grouped together because they have at least one matching column with respect to each other.

Upvotes: 0

Views: 137

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1271111

This is a graph traversal problem and painful to implement in SQL -- and especially MySQL which doesn't have support for hierarchical queries.

There is an iterative solution that is not too bad. The idea is to add a column "groupid" and then continually update. The rule for the update is to initially populate it with the id and then assign it the minimum along any of the dimensions. Here is an example of code:

update products
    set groupid = id;

/* now run this multiple times until there are no more updates */
update products p
    set groupid = (select min(groupid)
                   from products p2
                   where p.name = p2.name or
                         p.isbn = p2.isbn or
                         p.ean = p2.ean or
                         p.upc = p2.upc or
                         p.sku = p2.sku
                 );

Upvotes: 1

Palpatim
Palpatim

Reputation: 9272

Nope. GROUP BY requires that all elements in the grouped expression are identical amongst rows in that group. The best I can think of is:

SELECT
  *
FROM
  table
WHERE
  name = isbn 
  OR name = ean
  OR name = upc
  OR name = sku
  OR isbn = ean
  OR isbn = upc
  OR isbn = sku
  OR ean = upc
  OR ean = sku
  OR upc = sku

The downside, as you can see, is that this is phenomenally ugly, and doesn't scale well when adding a new column. I would only use this if it was a one-off script--for example, to identify data corruption.

Upvotes: 2

Related Questions