Clay Wardell
Clay Wardell

Reputation: 15582

Retrieve analyzed tokens from ElasticSearch documents

Trying to access the analyzed/tokenized text in my ElasticSearch documents.

I know you can use the Analyze API to analyze arbitrary text according your analysis modules. So I could copy and paste data from my documents into the Analyze API to see how it was tokenized.

This seems unnecessarily time consuming, though. Is there any way to instruct ElasticSearch to returned the tokenized text in search results? I've looked through the docs and haven't found anything.

Upvotes: 44

Views: 29764

Answers (3)

Torsten Engelbrecht
Torsten Engelbrecht

Reputation: 13486

This question is a litte old, but maybe I think an additional answer is necessary.

With ElasticSearch 1.0.0 the Term Vector API was added which gives you direct access to the tokens ElasticSearch stores under the hood on per document basis. The API docs are not very clear on this (only mentioned in the example), but in order to use the API you have to first indicate in your mapping definition that you want to store term vectors with the term_vector property on each field.

Upvotes: 18

javanna
javanna

Reputation: 60195

Have a look at this other answer: elasticsearch - Return the tokens of a field. Unfortunately it requires to reanalyze on the fly the content of your field using the script provided.
It should be possible to write a plugin to expose this feature. The idea would be to add two endpoints to:

  • allow to read the lucene TermsEnum like the solr TermsComponent does, useful to make auto-suggestions too. Note that it wouldn't be per document, just every term on the index with term frequency and document frequency (potentially expensive with a lot of unique terms)
  • allow to read the term vectors if enabled, like the solr TermVectorComponent does. This would be per document but requires to store the term vectors (you can configure it in your mapping) and allows also to retrieve positions and offsets if enabled.

Upvotes: 16

tick_tack_techie
tick_tack_techie

Reputation: 429

You may want to use scripting, however your server should have the scripting enabled.

curl 'http://localhost:9200/your_index/your_type/_search?pretty=true' -d '{
    "query" : {
        "match_all" : { }
    },
    "script_fields": {
        "terms" : {
            "script": "doc[field].values",
            "params": {
                "field": "field_x.field_y"
            }
        }
    }
}'

The default setting for allowing the script depends on the elastic search version, so please check that out from the official documentation.

Upvotes: 5

Related Questions