Reputation: 1739
Our goal
We would like to give our users the ability get search suggestions as they start typing, but the ElasticSearch suggesters don't offer anything that seems to fit our usecase of getting suggestions for snippets of text from articles. Ngramming and searching the titles of the documents are fine for indices with a lot of titles with great variation, but for a small number of articles, the titles just doesn't represent enough information and lots of search phrases return zero results. We also cannot have the users tag all the documents with relevant suggestion clues.
Our documents typically consist of a title and a description (body) plus various other properties like groups, categories and departments.
Our current solution: shingles in a separate index
Every time we index a document, we call elasticsearch _analyze endpoint to generate the shingles (2-5) for the description + title of the document. Each result (shingles produce a huge number of results) is then stored as a field called Suggestion in a copy of the original document in a new index. This is because someone users might want narrow down suggestions for documents that belong to a certain category or any other arbitrary filtering that we give the option to supply.
Original document (Main index):
{
"Title": "A fabulous document",
"Description": "A document with fabulous content"
"Category": "A"
}
Suggestion documents (Suggestion index)
(Suggestion 1)
{
"Title": "A fabulous document",
"Description": "A document with fabulous content",
"Category": "A"
"Suggestion": "A"
}
(Suggestion 2)
{
"Title": "A fabulous document",
"Description": "A document with fabulous content",
"Category": "A"
"Suggestion": "A document"
}
...
(Suggestion N)
{
"Title": "A fabulous document",
"Description": "A document with fabulous content",
"Category": "A"
"Suggestion": "a document with"
}
But as you can see, for an article of 1000 words, we could easily get hundreds or thousands of shingles, each duplicating the entire main document.
To search, we do a prefix search in the suggestions documents and a terms aggregation to get the word combinations that appear most frequently and our users actually kind of like this solution as long as they don't have anything better.
Another simpler, but too slow solution
We have tried to just analyze a copy_to field (autocomplete) with a shingles analyzer, and then do a terms aggregation with a regex include-filter to remove the terms that don't start with the search phrase, but that is just way too slow and memory hungry, as the number of irrelevant terms (to a specific query) for each field is just too great.
Search: "fabulo"
{
"size": 0,
"aggs": {
"autocomplete": {
"terms": {
"field": "autocomplete",
"include": {
"pattern": "fabulo(.*)"
}
}
}
},
"query": {
"prefix": {
"autocomplete": {
"value": "fabulo"
}
}
}
}
Basing suggestions on previous searches
We are working on basing suggestions on previous search phrases, but a new user will need to have some autocomplete suggestions based on content as well, if they have very few user-generated searches.
Question:
Is there any way to do this faster, simpler, better?
ElasticSearch suggesters all seem to require you to know the suggestions in advance or have descriptive titles. Seems very good for product suggestions, but not for large text-content suggestions. Plus, we have the filtering issue to take into account.
Upvotes: 5
Views: 1063
Reputation: 71
we're using combination of shingles and aggregation into a dedicated index:
"type": "shingle",
"max_shingle_size": 3,
"min_shigle_size": 1
},
Upvotes: 0