Rasmus
Rasmus

Reputation: 2933

Find concatenate words in Elasticsearch

Say I have indexed this data

song:{
  title:"laser game"
}

but the user is searching for

lasergame

How would you go about mapping/indexing/querying for this?

Upvotes: 5

Views: 1427

Answers (2)

Evaldas Buinauskas
Evaldas Buinauskas

Reputation: 14077

Easiest solution would be using nGrams. That would be the base to start working with and could be tweaked to meet your needs. But here you go:

Mappings

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "myAnalyzer": {
          "type": "custom",
          "tokenizer": "nGram",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "sample": {
      "properties": {
        "myField": {
          "type": "string",
          "analyzer": "myAnalyzer"
        }
      }
    }
  }
}

Test document

PUT /test/sample/1
{
  "myField": "laser game"
}

Query

GET /test/_search
{
  "query": {
    "match": {
      "myField": "lasergame"
    }
  }
}

Results

{
  "took": 47,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2161999,
    "hits": [
      {
        "_index": "test",
        "_type": "sample",
        "_id": "1",
        "_score": 0.2161999,
        "_source": {
          "myField": "laser game"
        }
      }
    ]
  }
}

This analyzer will create lots of ngrams in your index, such as la, las, lase...gam, game and etc. Both lasergame and laser game will produce a lot of similar tokens and will find your document as you'd expect.

Upvotes: 2

ChintanShah25
ChintanShah25

Reputation: 12672

This is kind of tricky problem.

1) I guess the most effective way might be to use compound token filter, with word list made up of some words you think user might concatenate.

"settings": {
    "analysis": {
      "analyzer": {
        "concatenate_split": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "myFilter"
          ]
        }
      },
      "filter": {
        "myFilter": {
          "type": "dictionary_decompounder",
          "word_list": [
            "laser",
            "game",
            "lean",
            "on",
            "die",
            "hard"
          ]
        }
      }
    }
  }

After applying analyzer, lasergame will split into laser and game along with lasergame, now this will give you results that has any of those words.

2) Another approach could be concatenating whole title with pattern replace char filter replacing all the spaces.

{
    "index" : {
        "analysis" : {
            "char_filter" : {
                "my_pattern":{
                    "type":"pattern_replace",
                    "pattern":"\\s+",
                    "replacement":""
                }
            },
            "analyzer" : {
                "custom_with_char_filter" : {
                    "tokenizer" : "standard",
                    "char_filter" : ["my_pattern"]
                }
            }
        }
    }
}

You need to use multi fields with this approach, with this pattern, laser game will be indexed as lasergame and your query will work. Here the problem is laser game play will be indexed as lasegameplay and search for lasergame wont return anything so you might want to consider using prefix query or wildcard query for this.

3) This might not make sense but you could also use synonym filter, if you think users are often concatenating some words.

Hope this helps!

Upvotes: 4

Related Questions