Is this a bug in the GAE Search API?

Question

I'm implementing a full text search based on the song database on GuitarParty.com. The data consists of lyrics in multiple languages, which is not a problem per se.

However, when search results are returned using snippeted_fields all accented characters within words, such as ÚúÉéÍí, are returned using their generic unaccented versions, UuEeIi.

This is how I form my query:

    query = search.Query( 
        query_string=qs, 
        options=search.QueryOptions(
            sort_options=search.SortOptions(
                #match_scorer=search.MatchScorer(),
                match_scorer=search.RescoringMatchScorer(),
                expressions=[
                    search.SortExpression(expression='_score + importance * 0.03', default_value=0)
                    #search.SortExpression(expression='_score', default_value=0)
                ],
                limit=1000,
            ),
            cursor=cursor, 
            returned_fields=['title','atomtitle','item', 'image'],
            snippeted_fields=['title','atomtitle', 'body','item'],
        )
    )

I'm pretty sure this is is not an encoding issue since everything looks just right if I pull my document fields directly (as I do with the titles). It's only the snippeted exoressions that display incorrectly.

To better see what I'm referring to you can take my test engine for a spin here: http://gp-search.appspot.com/ and search for something Icelandic. Example phrase: Vísur vatnsenda Rósu

This will return a document with this snippet:

Augun min og augun þin. O þa fogru steina. Mitt er þitt og þitt er mitt, þu veist hvað eg mei- na. Langt er siðan sa eg hann sannlega friður var hann.

Correctly spelled snippet should be:

Augun mín og augun þín. Ó þá fögru steina. Mitt er þitt og þitt er mitt, þú veist hvað eg mei- na. Langt er síðan sá ég hann sannlega friður var hann.

Am I better off generating my own snipped from the document data, or is there something I can do to pull snippets with accented characters within words?

Is this a bug in the GAE Search API?

Answers (1)

Related Questions