Kevin Orriss
Kevin Orriss

Reputation: 1032

How to get distinct values in one column with multiple possible values in the other?

I'm trying to work out if this is possible, let me give an example. Would be awesome if you could guide me in the right direction please.

Table = names
--------------------
Marks & Spencer
Marks & Spencer
marks & spencer

What I am trying to do is to return distinct values where I have converted all & signs and changed to upper case.

So my query is:

SELECT regexp_replace(UPPER(name), '&(amp;)*|\\+', '&', 'gi') AS name FROM names GROUP BY names;

However, what I would like to do is also return one of the original values, it does not matter which one, but I only want 1 row to be returned, like

Result
----------------
name            original
------------------------
MARKS&SPENCER   Marks & Spencer

Is this possible? Because at the moment, what I get returned is this:

Result
----------------
name            original
------------------------
MARKS&SPENCER   Marks & Spencer
MARKS&SPENCER   Marks & Spencer
MARKS&SPENCER   marks & spencer

Thank you for reading, would really appreciate the help.

==========

EDIT

The query I am using to get the above result is:

SELECT names.name, T.result FROM names 
INNER JOIN 
(
    SELECT DISTINCT regexp_replace(UPPER(name), '&(amp;)*|\\+', '&', 'gi') AS result FROM names 
) AS T 
ON regexp_replace(UPPER(name), '&(amp;)*|\\+', '&', 'gi')=T.result 
GROUP BY T.result, names.name 
ORDER BY T.result ASC

I am using PostgreSQL btw, which can do more than MySQL incase that changes things?

Upvotes: 3

Views: 971

Answers (3)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 659017

There is still room for improvement:

SELECT regexp_replace(upper(name), E'&(?:AMP;)+|\\+', '&', 'g') AS name
     , min(name) AS min_org_name
--   , string_agg(name) AS org_names  -- if you want a list of originals
--   , array_to_string(array_agg(name), ', ') AS org_names -- for pg < 9.0+
     , count(*) AS ct
FROM  (   
    SELECT *
    FROM  (VALUES
          ('Marks & Spencer')
        , ('Marks &amp; Spencer')
        , ('marks &amp; spencer')
        , ('marks &amp; speNceR + sons')
        , ('marks &amp;amp;AMP; speNceR & sons')
       ) AS names(name)
    ) name
GROUP  BY 1;

Major points

  • Improve regexp:
    • replace &amp;(amp;)* with identical &(amp;)+
    • after use of upper() on the original, the 'i' flag only slows execution. Rather upper case pattern, too: &(AMP;)+
    • Use non-capturing parenthesis: (?:)
    • As you use a escape sequence \\+, use proper syntax E''
  • Simplify GROUP BY with positional parameter, no need to spell it out twice

Upvotes: 1

MatBailie
MatBailie

Reputation: 86798

At present you're grouping by the original field (you can't group by a field in your select).

Do you want one of these?

SELECT DISTINCT
  name                                                       AS original,
  regexp_replace(UPPER(name), '&amp;(amp;)*|\\+', '&', 'gi') AS name
FROM
  names

Or...

SELECT
  name                                                       AS original,
  regexp_replace(UPPER(name), '&amp;(amp;)*|\\+', '&', 'gi') AS name
FROM
  names
GROUP BY
  name,
  regexp_replace(UPPER(name), '&amp;(amp;)*|\\+', '&', 'gi')

Or...

SELECT
  original,
  name
FROM
(
  SELECT
    name                                                       AS original,
    regexp_replace(UPPER(name), '&amp;(amp;)*|\\+', '&', 'gi') AS name
  FROM
    names
)
  AS clean_data
GROUP BY
  original,
  name

Upvotes: 0

John Doyle
John Doyle

Reputation: 7803

You need to group by the new name to get only one row and, as you don't care which original name appears, aggregate it with something like min:

SELECT min(name),regexp_replace(UPPER(name), '&amp;(amp;)*|\\+', '&', 'gi') AS name
FROM names
GROUP BY regexp_replace(UPPER(name), '&amp;(amp;)*|\\+', '&', 'gi')

Upvotes: 1

Related Questions