eran
eran

Reputation: 6921

analyze text with leading hyphens _analyze endpoint

Playing with elasticsearch to analyze some text (using the _analyze endpoint).

I see that when there are leading hyphens it does not return json, but some other format.

Try to search the docs about this but found nothing. Can someone point me to the reason? Is there a way to force the json output? Examples below.

Thanks.


Short examples, one text is 'this is', other is '---------this is'.

This works fine:

% curl -XGET 'localhost:9200/_analyze?analyzer=standard' -d 'this is'

{"tokens":[{"token":"this","start_offset":0,"end_offset":4,"type":"<ALPHANUM>","position":1},{"token":"is","start_offset":5,"end_offset":7,"type":"<ALPHANUM>","position":2}]}

but with leading --- it returns other format

% curl -XGET 'localhost:9200/_analyze?analyzer=standard' -d '---------this is'

---
tokens:
- token: "this"
  start_offset: 9
  end_offset: 13
  type: "<ALPHANUM>"
  position: 1
- token: "is"
  start_offset: 14
  end_offset: 16
  type: "<ALPHANUM>"
  position: 2

Upvotes: 0

Views: 78

Answers (2)

Val
Val

Reputation: 217314

When the documentation doesn't tell, you should turn towards the ultimate "documentation" resource, which is the code.

Starting on the main RestAnalyzeAction REST endpoint that will handle the _analyze call, we can see that on line 87, it will try to guess the content type of the request body by calling RestActions.guessBodyContentType. That method in turn will resort to calling XContentFactory.xContentType and in the latter we can find the reason on line 156, i.e. if the body starts with two hyphens, then the request is interpreted as YAML and the response will be formatted to YAML accordingly.

You can confirm this fact by adding the -v (for verbose) switch to your curl command:

curl -v -XGET 'localhost:9200/_analyze?analyzer=standard' -d '---------this is'

And the response you'll get will show you that the content type of the response is application/yaml

* Connected to localhost (::1) port 9200 (#0)
> GET /_analyze?analyzer=standard HTTP/1.1
> User-Agent: curl/7.37.1
> Host: localhost:9200
> Accept: */*
> Content-Length: 16
> Content-Type: application/x-www-form-urlencoded
> 
* upload completely sent off: 16 out of 16 bytes
< HTTP/1.1 200 OK
< Content-Type: application/yaml                   <---- HERE
< Content-Length: 183
< 
---
tokens:
- token: "this"
  start_offset: 9
  end_offset: 13
  type: "<ALPHANUM>"
  position: 1
- token: "is"
  start_offset: 14
  end_offset: 16
  type: "<ALPHANUM>"
  position: 2

Upvotes: 2

Pandiyan Cool
Pandiyan Cool

Reputation: 6565

when i tried the same in sense plugin, I'm getting following results

GET /_analyze?analyzer=standard&text=-this is

{
   "tokens": [
      {
         "token": "this",
         "start_offset": 1,
         "end_offset": 5,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "is",
         "start_offset": 6,
         "end_offset": 8,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}

GET /_analyze?analyzer=standard&text=---------this is

{
   "tokens": [
      {
         "token": "this",
         "start_offset": 9,
         "end_offset": 13,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "is",
         "start_offset": 14,
         "end_offset": 16,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}

Upvotes: 0

Related Questions