BenMorel
BenMorel

Reputation: 36524

How to make MediaWiki search ignore accents?

I'm running a MediaWiki instance that I just upgraded to the latest version at the time of this writing, 1.32.0. This wiki is nearly 10 years old and has gone through a number of upgrades.

It's a wiki in French language, and something annoying for French speakers is that the built-in search has always considered accented characters different from their non-accented counterparts, version after version.

For example, searching for Aromathérapie returns a number of results, while searching for Aromatherapie returns 0 results.

I thought that this was a database collation issue at first, until I noticed that the searchindex table is actually populated with ASCII-encoded UTF-8 words. Taking the example above, aromathérapie is stored as aromathu8c3a9rapie, so changing the table collation does not help.

Digging through the source code, I found the SearchMySQL::normalizeText() method that is responsible for this encoding.

And as far as I can see, the only normalization that this method does prior to encoding is lowercasing:

MediaWikiServices::getInstance()->getContentLanguage()->lc( $out )

So as it stands, it looks like there is no way to make the built-in search ignore accents.

I googled quite a lot for solutions, and found mostly old, unrelevant threads. I'm really surprised to not find more literature on the subject.

How can I make the MediaWiki search case- AND accents- insensitive?

Upvotes: 4

Views: 718

Answers (3)

BenMorel
BenMorel

Reputation: 36524

I'm not proud of it, but here's how I solved it, using MySQL's built-in support for collations (which does work with fulltext indexes—at least in recent versions of MySQL—contrary to what the code says):

  • Converted the searchindex table to utf8mb4:
    ALTER TABLE searchindex CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
  • Applied this patch to includes/search/SearchMySQL.php:
    • no lowercasing, no replacing of UTF-8 chars with their hex-encoded counterpart
    • unicode u flag in preg_replace()
  • Rebuilt the searchindex table: php maintenance/rebuildtextindex.php

A similar procedure will have to be applied whenever the MediaWiki installation is updated, which adds to the maintenance cost. The procedure being simple, it's a cost I'm willing to accept right now.

A final note is that this does not make the autocompletion work case-insensitively, only the search results. This is good enough for me for now.

Upvotes: 3

anand_v.singh
anand_v.singh

Reputation: 2838

Lets tackle each problem one at a time.

First lets handle the smaller problem, case sensitivity

select * from tableName where lower(col_name) = lower(searchTerm);

or

select * from tableName where upper(col_name) = upper(searchTerm);

Part 2 handling the encoding, as suggested by others, you can download a more competent search tool, or you can change how your search term is represented, convert

searchTerm to %s%e%a%r%c%h%T%e%r%m%. This will basically add wildcards capable of ignoring extra characters added by UTF-8 encoding. The advantage of this approach is you have to make minimal changes to your existing code, but it slightly increases the computation and complexity.

This was written in the context of SQL, if you are using other database management, queries may slightly vary but the idea remains the same.

That should get the job done. If any questions feel free to add comments.

Upvotes: -1

S.Spieker
S.Spieker

Reputation: 7365

If you do not want CirrusSearch, you could try a lightweight extension: TitleKey

Installation

  • Download and place the file(s) in a directory called TitleKey in your extensions/ folder.
  • Add the following code at the bottom of your LocalSettings.php:

    wfLoadExtension( 'TitleKey' );
    
  • Run the update script which will automatically create the necessary database tables that this extension needs.

  • Run the rebuildTitleKeys.php script (this requires command-line access):

    php extensions/TitleKey/rebuildTitleKeys.php
    
  • Done – Navigate to Special:Version on your wiki to verify that the extension is successfully installed.

Upvotes: 0

Related Questions