Raghav salotra
Raghav salotra

Reputation: 871

Regular expression in elasticsearch

What should be the regular expression pattern for a tokenizer in Elasticsearch for matching C# and C++ each separately?

Right now we have one analyzer for this, but whenever we are trying to search C# it is showing C++ also as a match and vice versa.

Upvotes: 0

Views: 642

Answers (1)

Sloan Ahrens
Sloan Ahrens

Reputation: 8718

Assuming I'm understanding you correctly, one thing you can do is set up an analyzer that just tokenizes on whitespace. The default standard analyzer tokenizes on symbols as well as whitespace, so "c++" and "c#" both get turned into the term "c", so both documents will match a search for one or the other.

One way around this (though it might cause you other headaches), is to use an analyzer like this:

"whitespace_analyzer": {
   "type": "custom",
   "tokenizer": "whitespace",
   "filter": [
      "lowercase",
      "asciifolding"
   ]
}

Or, in a full toy example, I can set up an index like:

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "analysis": {
         "analyzer": {
            "whitespace_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "text_field": {
               "type": "string",
               "analyzer": "whitespace_analyzer"
            }
         }
      }
   }
}

then add a few docs via the bulk api:

POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc", "_id":1}}
{"text_field": "some text with C++"}
{"index":{"_index":"test_index","_type":"doc", "_id":2}}
{"text_field": "some text with C#"}
{"index":{"_index":"test_index","_type":"doc", "_id":3}}
{"text_field": "some text with Objective-C"}

Now a search for "C++" only gives me back the document that contains that term:

POST /test_index/_search
{
    "query": {
        "match": {
           "text_field": "C++"
        }
    }
}
...
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.70273256,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.70273256,
            "_source": {
               "text_field": "some text with C++"
            }
         }
      ]
   }
}

and likewise with "C#"

POST /test_index/_search
{
    "query": {
        "match": {
           "text_field": "C#"
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.70273256,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "2",
            "_score": 0.70273256,
            "_source": {
               "text_field": "some text with C#"
            }
         }
      ]
   }
}

This solution may or may not end up giving you what you want, because it won't tokenize on punctuation either.

Here is the code I used:

http://sense.qbox.io/gist/92871671ea7313356cbbd1ea900c3d55944bd20b

EDIT: Here is a slightly more advanced solution that can help solve the punctuation problem. I got the idea from this article. The basic idea is that you can declare certain symbol characters to be alpha-numeric characters.

So I create the index using a custom token filter, then add the same three docs plus another one that the previous solution would not handle correctly:

DELETE /test_index

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "analysis": {
         "filter": {
            "symbol_filter": {
               "type": "word_delimiter",
               "type_table": [
                  "# => ALPHANUM",
                  "+ => ALPHANUM",
                  "@ => ALPHANUM"
               ]
            }
         },
         "analyzer": {
            "whitespace_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding",
                  "symbol_filter"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "text_field": {
               "type": "string",
               "analyzer": "whitespace_analyzer"
            }
         }
      }
   }
}

POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc", "_id":1}}
{"text_field": "some text with C++"}
{"index":{"_index":"test_index","_type":"doc", "_id":2}}
{"text_field": "some text with C#"}
{"index":{"_index":"test_index","_type":"doc", "_id":3}}
{"text_field": "some text with Objective-C"}
{"index":{"_index":"test_index","_type":"doc", "_id":4}}
{"text_field": "some text with Objective-C, C#, and C++."}

Now querying for "C++" will return both the documents that contain that token:

POST /test_index/_search
{
    "query": {
        "match": {
           "text_field": "C++"
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.643841,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.643841,
            "_source": {
               "text_field": "some text with C++"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "4",
            "_score": 0.40240064,
            "_source": {
               "text_field": "some text with Objective-C, C#, and C++."
            }
         }
      ]
   }
}

Here is the code for this one:

http://sense.qbox.io/gist/5c583b4e99b8f3b088925ccdb894695aa0c257cb

Upvotes: 1

Related Questions