Reputation: 123
I'm implementing a full text search based on the song database on GuitarParty.com. The data consists of lyrics in multiple languages, which is not a problem per se.
However, when search results are returned using snippeted_fields all accented characters within words, such as ÚúÉéÍí, are returned using their generic unaccented versions, UuEeIi.
This is how I form my query:
query = search.Query(
query_string=qs,
options=search.QueryOptions(
sort_options=search.SortOptions(
#match_scorer=search.MatchScorer(),
match_scorer=search.RescoringMatchScorer(),
expressions=[
search.SortExpression(expression='_score + importance * 0.03', default_value=0)
#search.SortExpression(expression='_score', default_value=0)
],
limit=1000,
),
cursor=cursor,
returned_fields=['title','atomtitle','item', 'image'],
snippeted_fields=['title','atomtitle', 'body','item'],
)
)
I'm pretty sure this is is not an encoding issue since everything looks just right if I pull my document fields directly (as I do with the titles). It's only the snippeted exoressions that display incorrectly.
To better see what I'm referring to you can take my test engine for a spin here: http://gp-search.appspot.com/ and search for something Icelandic. Example phrase: Vísur vatnsenda Rósu
This will return a document with this snippet:
Augun min og augun þin. O þa fogru steina. Mitt er þitt og þitt er mitt, þu veist hvað eg mei- na. Langt er siðan sa eg hann sannlega friður var hann.
Correctly spelled snippet should be:
Augun mín og augun þín. Ó þá fögru steina. Mitt er þitt og þitt er mitt, þú veist hvað eg mei- na. Langt er síðan sá ég hann sannlega friður var hann.
Am I better off generating my own snipped from the document data, or is there something I can do to pull snippets with accented characters within words?
Upvotes: 0
Views: 248
Reputation: 19864
The data you put in gets normalized so that you dont have to worry about accents or missing accents when searching it.
Upvotes: 1