David542
David542

Reputation: 110083

elasticsearch lowercase like %term% search

I have the following filepath I need to save to ES:

/mnt/qfs-X/Asset_Management/XG_Marketing_/Episodic-SG_1001_1233.jpg

I would like to be able to search the following and get a match:

search = "qf episodic sg_1001 JPG"

And get a match, in other words, it would be a search such as the following in (my)sql:

select * from table where fp like '%qf%' and fp like '%episodic%' 
and fp like '%sg_1001%' and fp like '%jpg%'

Two questions here:

  1. What would be the proper way to store this in my index? Current I have the very basic (and incorrect) keyword field --

    body = {
            "mappings": {
                "_doc": {
                    "dynamic": "strict",
                    "properties": {
                        "path":        {"type": "keyword"},
                    }
                }
            }
    }
    

  1. What would be the correct way to search the above in ES? Current I have --

    "query": {
      "bool": {
        "must": [
          { "match": { "fp": "qf" } },
          { "match": { "fp": "episodic" } },
          { "match": { "fp": "sg_1001" } },
          { "match": { "fp": "JPG" } }
        ]
      }
    }
    

Upvotes: 0

Views: 91

Answers (1)

Kamal Kunjapur
Kamal Kunjapur

Reputation: 8840

Let's say your input is this:

/mnt/qfs-X/Asset_Management/XG_Marketing_/Episodic-SG_1001_1233.jpg

What I am going to do is convert all this forward slash and underscore into whitespaces

So effectively your input would be looking now as

mnt qfs-X Asset_Management XG Marketing Episodic-SG 1001 1233.jpg

Using the standard tokenizer along with token_filter(standard and lowercase) below would be the list of words you'd finally have which would be stored in your inverted index eventually which could be queried.

mnt qfs X asset management xg marketing episodic sg 1001 1233 jpg

Below is the sample mapping and query for the above:

Mapping

PUT mysampleindex
{  
   "settings":{  
      "analysis":{  
         "analyzer":{  
            "my_analyzer":{  
               "tokenizer":"standard",
               "char_filter":[  
                  "my_char_filter"
               ],
               "filter":[  
                  "standard",
                  "lowercase"
               ]
            }
         },
         "char_filter":{  
            "my_char_filter":{  
               "type":"pattern_replace",
               "pattern":"\\/|_",
               "replacement":" "
            }
         }
      }
   },
   "mappings":{  
      "mydocs":{  
         "properties":{  
            "mytext":{  
               "type":"text",
               "analyzer":"my_analyzer"
            }
         }
      }
   }
}

Sample Document

POST mysampleindex/mydocs/1
{
  "mytext": "nt/qfs-X/Asset_Management/XG_Marketing_/Episodic-SG_1001_1233.jpg"
}

Sample Query

POST mysampleindex/_search
{  
   "query":{  
      "match":{  
         "mytext":"qfs episodic sg 1001 jpg"
      }
   }
}

Keep in mind that when you send the above query to Elasticsearch, Elasticsearch would take the input and apply the Search Time Analysis there as well. I'd suggest you to read this link for more information on this and its the reason why you would get the document even with the below query string.

"mytext": "QFS EPISODIC SG 1001 jpg"

Now if you try to search using pisodic (episodic) i.e below query as an example, the search wouldn't return anything, coz your inverted index doesn't save the words in that fashion. For such scenarios I'd suggest you to make use of N-Gram Tokenizer so that episodic would be further create words like episodi, pisodic which would be stored in inverted index.

POST mysampleindex/_search
{  
   "query":{  
      "match":{  
         "mytext":"pisodic"
      }
   }
}

Also note that I have been making use of text and not keyword datatype. I hope this helps!

Upvotes: 1

Related Questions