Amazon CloudSearch accented words

I have an index with documents with accented words.

For example this document in Portuguese:

title => 'Ponte metálica'

If i search "metálica" it matches, so no problem. But usually people search without accents, so it's very usual to search just for "metalica" (note the "a" without accent "á").

But it's not returning any results. I tested it in the AWS console and via endpoint /search. Im using the 2013 API.

I think the Synonyms can't solve this issue since they aren't full words

Upvotes: 4

Answers (1)

Fabio Manzano

Reputation: 2865

It looks like you posted the same question in AWS forums and got a reply:

The CloudSearch Portuguese stemmer does not remove accents, so á won't match a, and it does not currently have an option to remove them.

Two work-arounds I can think of:

Remove accents before uploading. (possibly to a different field)

Use a copy field, and the "mulitiple languages" analysis mode. This won't stem words by Portuguese rules, unfortunately, but it does remove accents!

I like the idea of removing the accent before upload, but I also have two other ideas:

Use fuzzy matching, so that you can tolerate one or maybe two "wrong" characters. Might have performance drawback to consider.
Provide an auto-complete/suggestor solution similar to a "did you mean?" type of experience.

I found this Stack Overflow thread from around 2014 that discusses these two possibilities, still using CloudSearch: Implementing "Did you mean?" using Amazon CloudSearch

About the fuzzy matching operator:

You can also perform fuzzy searches with the simple query parser. To perform a fuzzy search, append the ~ operator and a value that indicates how much terms can differ from the user query string and still be considered a match. For example, the specifying planit~1 searches for the term planit and allows matches to differ by up to one character, which means the results will include hits for planet.

And about auto-complete, with fuzzy matching option:

When you request suggestions, Amazon CloudSearch finds all of the documents whose values in the suggester field start with the specified query string—the beginning of the field must match the query string to be considered a match. The return data includes the field value and document ID for each match. You can configure suggesters to find matches for the exact query string, or to perform approximate string matching (fuzzy matching) to correct for typographical errors and misspellings.

Upvotes: 1

Amazon CloudSearch accented words

Answers (1)

Related Questions