Reputation: 627
I have an archive of university theses and publications indexed (with BM25 similarity) on Lucene (Java version). I have English document and Italian document, for this reason i have duplicate field like: pdf, pdf_en or like: titolo, titolo_en. When i have an italian document i fill italian field, otherwise i fill english filed.
Now i have a BooleanQuery with MultiFieldQueryParser, this is my code:
String[] fieldsGEN={"url","autori","lingua","settore","pdfurl"};
String[] fieldsITA={"titolo","tipologia","abstract","pdf"};
String[] fieldsENG={"titolo_en","tipologia_en", "abstract_en","pdf_en"};
MultiFieldQueryParser parserGEN = new MultiFieldQueryParser(version, fieldsGEN, analyzerIT);
MultiFieldQueryParser parserITA = new MultiFieldQueryParser(version, fieldsITA, analyzerIT);
MultiFieldQueryParser parserENG = new MultiFieldQueryParser(version, fieldsENG, analyzerENG);
parserITA.setDefaultOperator(QueryParser.Operator.OR);
parserITA.setDefaultOperator(QueryParser.Operator.OR);
parserENG.setDefaultOperator(QueryParser.Operator.OR);
Query query4 =parserGEN.parse(ricerca.ricerca);
bq.add(query4, Occur.SHOULD);
Query query2 =parserITA.parse(ricerca.ricerca);
bq.add(query2, Occur.SHOULD);
Query query3 =parserENG.parse(ricerca.ricerca);
bq.add(query3, Occur.SHOULD);
If I search "anna" (Name of an author) the 3 query are:
Query: [titolo:anna tipologia:anna abstract:anna pdf:anna]
Query: [titolo_en:anna tipologia_en:anna abstract_en:anna pdf_en:anna]
Query: [url:anna autori:anna lingua:anna settore:anna pdfurl:anna]
and I also authors without the name anna even if they are in the last position (about 3 document of 21 on 1000 indexed), I suppose that finds them in other fields.
Do you think the query is done well? the query can be improved? how? a search engine like google how it works on multifield search?
There is a better way to deal with multi-language field?
Thanks, Neptune.
Upvotes: 0
Views: 372
Reputation: 5693
Unless you have both translations for all documents, I would create 2 indexes -- 1 for each language, using the same field names for each index. You would then use a MultiReader with the search queries.
The problem with this approach is words that are spelled the same in each language but have different meanings between English and Italian. Apart from those words, I think that this architecture will be easier to understand as well as easier to interpret the results of.
Upvotes: 1