Reputation: 8158
I have a table with contents that look similar to this:
id | title
------------
1 | 5. foo
2 | 5.foo
3 | 5. foo*
4 | bar
5 | bar*
6 | baz
6 | BAZ
…and so on. I would like to group by the titles and ignore the extra bits. I know Postgres can do this:
SELECT * FROM (
SELECT regexp_replace(title, '[*.]+$', '') AS title
FROM table
) AS a
GROUP BY title
However, that's quite simple and would get very unwieldy if I tried to anticipate all the possible variations. So, the question is, is there a more general way to do fuzzy grouping than using regexp? Is it even possible, at least without breaking one's back doing it?
Edit: To clarify, there is no preference for any of the variations, and this is what the table should look like after grouping:
title
------
5. foo
bar
baz
I.e., the variations would be items that are different just by a few characters or capitalization, and it doesn't matter which ones are left as long as they're grouped.
Upvotes: 2
Views: 1530
Reputation: 425683
For any grouping you should have transitive equality, that is a ~= b, b ~= c => a ~= c
.
Formulate it strictly using words and we'll try to formulate it using SQL
.
For instance, which group should foo*bar
go to?
Update:
This query replaces all non-alphanumerical characters with spaces and returns first title from each group:
SELECT DISTINCT ON (REGEXP_REPLACE(UPPER(title), '[^[:alnum:]]', '', 'g')) title
FROM (
VALUES
(1, '5. foo'),
(2, '5.foo'),
(3, '5. foo*'),
(4, 'bar'),
(5, 'bar*'),
(6, 'baz'),
(7, 'BAZ')
) rows (id, title)
Upvotes: 3
Reputation: 754570
At some time, you are going to have to define what makes a set of values belong together in a group. If that's too hard, maybe you should prohibit and inhibit the entry of fuzzy data, or if you must permit it, add a column that contains a sanitized version of the title for use by the grouping operations.
Upvotes: 2