Aaron Drenberg
Aaron Drenberg

Reputation: 1117

App Engine Search API (Document Search) - Multiple Languages

I have Documents that I'd like to make searchable in 3 different languages. Since I can have multiple fields with the same name/type, the following Document structure works (this is a simplified example).

document = search.Document(
    fields=[
      search.TextField(
        name="name",
        language="en",
        value="dog"),
      search.TextField(
        name="name",
        language="es",
        value="perro"),
      search.TextField(
        name="name",
        language="fr",
        value="chien")
    ]
  )
  index = search.Index("my_index")
  index.put(document)

Specifying the language helps Google tokenize the value of the TextField.

The following queries all work, each returning one result:

print index.search("name: dog")
print index.search("name: perro")
print index.search("name: chien")

Here is my question: Can I restrict a search to only target fields with a specific language?

The purpose is to avoid getting false positive results. Since each language uses the Arabic alphabet, it's possible that someone performing a full text search in Spanish may see English results that are not relevant.

Thank you.

Upvotes: 6

Views: 320

Answers (2)

Frank Wilson
Frank Wilson

Reputation: 3250

You could use a separate index for each language.

Define a utility function for resolving the correct index for a given language:

def get_index(lang):
   return search.Index("my_index_{}".format(lang))

Insert documents:

document = search.Document(
    fields=[
      search.TextField(
        name="name",
        language="en",
        value="dog"),
    ])

get_index('en').put(document)

document = search.Document(
    fields=[
      search.TextField(
        name="name",
        language="fr",
        value="chien")
    ])

get_index('fr').put(document)

Query by language:

query = search.Query(
    'name: chien')

results = get_index('fr').search(query)

for doc in results:
    print doc

Upvotes: 2

Frank Wilson
Frank Wilson

Reputation: 3250

You can use facets to add fields to a document that don't actually appear in the document (metadata). These would indicate what languages appear in the document.

Document insertion:

    index = search.Index("my_index")
    document = search.Document(
        fields=[
          search.TextField(
            name="name",
            language="en",
            value="dog"),
          search.TextField(
            name="name",
            language="es",
            value="perro"),
          search.TextField(
            name="name",
            language="fr",
            value="chien")
        ],
        facets=[
           search.AtomFacet(name='lang', value='en'),
           search.AtomFacet(name='lang', value='es'),
           search.AtomFacet(name='lang', value='fr'),
        ],
      )
    index.put(document)
    document = search.Document(
        fields=[
          search.TextField(
            name="name",
            language="es",
            value="gato"),
          search.TextField(
            name="name",
            language="fr",
            value="chat")
        ],
        facets=[
           # no english in this document so leave out lang='en'
           search.AtomFacet(name='lang', value='es'),
           search.AtomFacet(name='lang', value='fr'),
        ],
      )
    index.put(document)

Query:

index = search.Index("my_index")
query = search.Query(
    '', # query all documents, cats and dogs.
    # filter docs by language facet
    facet_refinements=[
        search.FacetRefinement('lang', value='en'),
    ])

results = index.search(query)
for doc in results:
    result = {}
    for f in doc.fields:
        # filter fields by language
        if f.language == 'en':
            result[f.name] = f.value
    print result

Should print {u'name': u'dog'}.

Note that although we can fetch only documents that have english in them, we still have to filter out the fields in other languages in those documents. This why we iterate through the fields only adding those in english to result.

If you want to know more about the more general use case for faceted search, this answer gives a pretty good idea.

Upvotes: 2

Related Questions