Pentium10
Pentium10

Reputation: 207952

testing an elasticsearch custom analyzer - pipe delimited keywords

I have this index with pipe as custom analyzer. When I am trying to test it, it returns every char, and not pipe delimited words.

I am trying to construct for an use case where my input line keywords looks like: crockpot refried beans|corningware replacement|crockpot lids|recipe refried beans and EL will return matches after it has been exploded.

{
  "keywords": {
    "aliases": {

    },
    "mappings": {
      "cloud": {
        "properties": {
          "keywords": {
            "type": "text",
            "analyzer": "pipe"
          }
        }
      }
    },
    "settings": {
      "index": {
        "number_of_shards": "5",
        "provided_name": "keywords",
        "creation_date": "1513890909384",
        "analysis": {
          "analyzer": {
            "pipe": {
              "type": "custom",
              "tokenizer": "pipe"
            }
          },
          "tokenizer": {
            "pipe": {
              "pattern": "|",
              "type": "pattern"
            }
          }
        },
        "number_of_replicas": "1",
        "uuid": "DOLV_FBbSC2CBU4p7oT3yw",
        "version": {
          "created": "6000099"
        }
      }
    }
  }
}

When I am trying to test it following this guide.

curl -XPOST 'http://localhost:9200/keywords/_analyze' -d '{
 "analyzer": "pipe",
 "text": "pipe|pipe2"
}'

I get back char-by-char results.

{
  "tokens": [
    {
      "token": "p",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "i",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "p",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    },
    {
      "token": "e",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 3
    },

Upvotes: 0

Views: 420

Answers (1)

Val
Val

Reputation: 217354

Good work, you're almost there. Since the pipe | character is a reserved character in regular expressions, you need to escape it like this:

      "tokenizer": {
        "pipe": {
          "pattern": "\\|",   <--- change this
          "type": "pattern"
        }
      }

And then your analyzer will work and produce this:

{
  "tokens": [
    {
      "token": "pipe",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "pipe2",
      "start_offset": 5,
      "end_offset": 10,
      "type": "word",
      "position": 1
    }
  ]
}

Upvotes: 1

Related Questions