Fuzzy grouping in Postgres

Question

I have a table with contents that look similar to this:

id | title
------------
1  | 5. foo
2  | 5.foo
3  | 5. foo*
4  | bar
5  | bar*
6  | baz
6  | BAZ

…and so on. I would like to group by the titles and ignore the extra bits. I know Postgres can do this:

SELECT * FROM (
  SELECT regexp_replace(title, '[*.]+$', '') AS title
  FROM table
) AS a
GROUP BY title

However, that's quite simple and would get very unwieldy if I tried to anticipate all the possible variations. So, the question is, is there a more general way to do fuzzy grouping than using regexp? Is it even possible, at least without breaking one's back doing it?

Edit: To clarify, there is no preference for any of the variations, and this is what the table should look like after grouping:

title
------
5. foo
bar
baz

I.e., the variations would be items that are different just by a few characters or capitalization, and it doesn't matter which ones are left as long as they're grouped.

Quassnoi · Accepted Answer

For any grouping you should have transitive equality, that is a ~= b, b ~= c => a ~= c.

Formulate it strictly using words and we'll try to formulate it using SQL.

For instance, which group should foo*bar go to?

Update:

This query replaces all non-alphanumerical characters with spaces and returns first title from each group:

SELECT  DISTINCT ON (REGEXP_REPLACE(UPPER(title), '[^[:alnum:]]', '', 'g')) title
FROM    (
        VALUES
        (1, '5. foo'),
        (2, '5.foo'),
        (3, '5. foo*'),
        (4, 'bar'),
        (5, 'bar*'),
        (6, 'baz'),
        (7, 'BAZ')
        ) rows (id, title)

Fuzzy grouping in Postgres

Answers (2)

Related Questions