Reputation: 1117
I have Documents that I'd like to make searchable in 3 different languages. Since I can have multiple fields with the same name/type, the following Document structure works (this is a simplified example).
document = search.Document(
fields=[
search.TextField(
name="name",
language="en",
value="dog"),
search.TextField(
name="name",
language="es",
value="perro"),
search.TextField(
name="name",
language="fr",
value="chien")
]
)
index = search.Index("my_index")
index.put(document)
Specifying the language helps Google tokenize the value of the TextField
.
The following queries all work, each returning one result:
print index.search("name: dog")
print index.search("name: perro")
print index.search("name: chien")
Here is my question: Can I restrict a search to only target fields with a specific language?
The purpose is to avoid getting false positive results. Since each language uses the Arabic alphabet, it's possible that someone performing a full text search in Spanish may see English results that are not relevant.
Thank you.
Upvotes: 6
Views: 320
Reputation: 3250
You could use a separate index for each language.
Define a utility function for resolving the correct index for a given language:
def get_index(lang):
return search.Index("my_index_{}".format(lang))
Insert documents:
document = search.Document(
fields=[
search.TextField(
name="name",
language="en",
value="dog"),
])
get_index('en').put(document)
document = search.Document(
fields=[
search.TextField(
name="name",
language="fr",
value="chien")
])
get_index('fr').put(document)
Query by language:
query = search.Query(
'name: chien')
results = get_index('fr').search(query)
for doc in results:
print doc
Upvotes: 2
Reputation: 3250
You can use facets to add fields to a document that don't actually appear in the document (metadata). These would indicate what languages appear in the document.
Document insertion:
index = search.Index("my_index")
document = search.Document(
fields=[
search.TextField(
name="name",
language="en",
value="dog"),
search.TextField(
name="name",
language="es",
value="perro"),
search.TextField(
name="name",
language="fr",
value="chien")
],
facets=[
search.AtomFacet(name='lang', value='en'),
search.AtomFacet(name='lang', value='es'),
search.AtomFacet(name='lang', value='fr'),
],
)
index.put(document)
document = search.Document(
fields=[
search.TextField(
name="name",
language="es",
value="gato"),
search.TextField(
name="name",
language="fr",
value="chat")
],
facets=[
# no english in this document so leave out lang='en'
search.AtomFacet(name='lang', value='es'),
search.AtomFacet(name='lang', value='fr'),
],
)
index.put(document)
Query:
index = search.Index("my_index")
query = search.Query(
'', # query all documents, cats and dogs.
# filter docs by language facet
facet_refinements=[
search.FacetRefinement('lang', value='en'),
])
results = index.search(query)
for doc in results:
result = {}
for f in doc.fields:
# filter fields by language
if f.language == 'en':
result[f.name] = f.value
print result
Should print {u'name': u'dog'}
.
Note that although we can fetch only documents that have english in them, we still have to filter out the fields in other languages in those documents. This why we iterate through the fields only adding those in english to result
.
If you want to know more about the more general use case for faceted search, this answer gives a pretty good idea.
Upvotes: 2