Reputation: 959

Elastic search sort field containing special characters numbers and alpahbets

I created a case insensitive analyzer as

PUT /dhruv3
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "analyzer_keyword": {
            "tokenizer": "keyword",
            "filter": [ "lowercase", "asciifolding" ]
          }
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "about": {
          "type": "string",
          "analyzer": "analyzer_keyword"
        },
        "firsName": {
          "type": "string"
        }
      }
    }
  }
}

and used it in mapping. About field is supposed to contain aplha numerc and special characters.Then I inserted some values with about field as

1234, `pal, pal, ~pal

. Besides searching I need to get result sorted. Searching is working well but when I try to sort them as

GET dhruv/test/_search
{
  "sort": [
    {
      "about": {
        "order": "asc"
      }
    }
  ]
}

I get results in about field as

1234,`pal,pal,~pal

. But I expect them to be as first special characters, then numbers and then alphabets.

I did some home work and came to know that its because of their ASCII values. SO i searched internet and tried even asciifolding. But didn't work out. I know there is some solution some where but I can't figure out. Please guide me

Upvotes: 0

Answers (2)

asyncwait

Reputation: 4537

The asciifolding has nothing to do with what you're trying to achieve. The ASCIIFoldingFilter.java has a wealth of information, it merely decodes unicode chars like \uFF5E to its ASCII equivalent in case if one can be provided as the alternative.

Adding to @Val's answer, in case you want the values sorted in the order of special chars then numbers then alphabets, you may want to consider using -

GET /ascii/test/_search
{
  "sort": {
    "_script": {
      "script": "r = doc['about'].value.chars[0]; return !r.isLetter() ? r.isDigit() ? 1 : -1 : 2",
      "type": "number",
      "order": "asc"
    }
  }
}

Also, note this sorting may not be perfect since we only took care of first char in the script. You may want to write a robust script that takes care of entire value.

This gist is a good example of what you can achieve using embedded scripts.

Upvotes: 1

Val

Reputation: 217594

You're right in that the sorting behavior you are seeing is due to the ASCII value of the special characters to be bigger than the ASCII value of digits. To be precise, looking at the ASCII table, we have the following values:

1 has the ASCII value 49
` has the ASCII value 96
p has the ASCII value 112
~ has the ASCII value 126

The asciifolding token filter simply transforms characters and digits which are NOT in the ASCII table (i.e. first 127 characters) into their ASCII equivalent, if such one exists (e.g. é, è, ë, ê are transformed to e). Since all the characters above are in the ASCII table, this is not what you're looking for.

If you want the special characters to come up first in the search there are several ways.

One way to achieve it is simply to negate their ASCII value so that they will always come before the ASCII 0 character and then use script sorting:

{
  "sort": [
    {
      "_script": {
        "script": "return doc['about'].value.chars[0].isLetterOrDigit() ? 1 : -1",
        "type": "number",
        "order": "asc"
      }
    }
  ]
}

Upvotes: 4

Elastic search sort field containing special characters numbers and alpahbets

Answers (2)

Related Questions