Reputation: 194
German language uses diacritical characters ä, ö, ü. For international use, they are translated into ae, oe, ue (not a, o, u). This mean, that Müller is Mueller on his ID document. This is what we get, when we read the document with (for example) passport reader and this is what we save to the database table.
In next step we search for the records. We do it in two ways:
With manual insert there is little problem, because user may enter data international way: 'Mueller' or popular way 'Müller'.
This problem can be solved by using postgres extension Unaccent and modification of unaccent.rules file, so despite is user inserts 'Mueller' or 'Müller', we search in the database for Mueller.
So far so good...
BUT
in the same table we have also other origin's names - for example Turkish ones. Turks translates theirs umlauts (ä, ö, ü) directly into a, o, u, and this way they are saved on the documents, so Müller would by Muller on Turkish document. This causes a problem because (as described before) we search with German unnaccent.rules and we don't find people who we search for.
Long story, but finally question...
Select * from table where last_name = unaccent('Müller' (use German rules)) or last_name = unaccent('Müller' (use Turkish rules))
(I know that what's above does not work, but maybe there is something similar we could use)
regards M
Upvotes: 1
Views: 423
Reputation: 246073
The solution should be simple. Define your German unaccent dictionary (I'll call it entumlauten
), then query like
SELECT ...,
last_name = unaccent('unaccent', 'Müller') AS might_be_turkish,
last_name = unaccent('entumlauten', 'Müller') AS might_be_german,
FROM tab
WHERE last_name IN (unaccent('unaccent', 'Müller'),
unaccent('entumlauten', 'Müller'))
IN
(or (= ANY
) will perform better than OR
, because it can use an index scan. The additional columns in the SELECT
list tell you which condition was matched.
Upvotes: 2
Reputation: 1107
Use soundex()
function. This is suitable only for creating lists for human user to pick wanted name. You probably should clean all diacritics (use the Turkish way) before using this.
It also handles similar sounding letters, like C, S and Z or D and T. So Schmidt
would match Smith
or Jönssen
matches Johnson
.
Upvotes: 0