Reputation: 79
I use PostgreSQL with unaccent extension in order to handle German umlauts.
By default, the mapping works in the following way:
Müller -> Muller, but this is not natural. Ü should be mapped into UE, Ö should be mapped into OE, but by default it is mapped to O.
In order to handle this, I edited the unaccent.rules file in /usr/local/share/postgresql/tsearch_data, as suggested by some german colleagues that stumbled upon this problem as well.
So, essentially, upon creating a PostgreSQL extension, in parallel, on the container initialization, I mount a custom rule file and replace it with the original one, and, voilà, I have a working solution! All I have to do is just to wrap the column name like this:
SELECT * FROM students
WHERE unaccent(name) ILIKE unaccent('Muel%')
This works on my local machine and all the environments where the DB lives inside a Docker container, but I'm not sure how I can make this work on Amazon RDS which is our production environment (since I don't have the access to the file system).
Upvotes: 0
Views: 153
Reputation: 247800
You won't be able to create a custom “unaccent” dictionary in a hosted database, since that requires file system access.
I recommend that you use a custom function like
CREATE FUNCTION normalized(text) RETURNS text
IMMUTABLE PARALLEL SAFE
RETURN unaccent(
replace(
replace(
replace(
replace(
lower($1), 'ö', 'oe'
), 'ü', 'ue'
), 'ä', 'ae'
), 'ß', 'ss'
)
);
Then you could create a text search index with
CREATE INDEX ON tab USING gin (to_tsvector('german', normalized(doc)));
You would apply the same normalization to documents and search strings:
SELECT ... FROM tab
WHERE to_tsvector('german', normalized(doc)) @@
to_tsquery('german', normalized('Schöffen'));
Upvotes: 1