Reputation:
How to create custom analyzer that tokenize a field by '/' characters only.
I have url strings in my field for exp: "https://stackoverflow.com/questions/ask" I want tokenized this like: "http", "stackoverflow.com", "questions" and "ask"
Upvotes: 0
Views: 25
Reputation: 8718
This seems to do what you want, using a pattern tokenizer:
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"slash_analyzer": {
"type": "pattern",
"pattern": "[/:]+",
"lowercase": true
}
}
}
},
"mappings": {
"doc": {
"properties": {
"url": {
"type": "string",
"index_analyzer": "slash_analyzer",
"search_analyzer": "standard",
"term_vector": "yes"
}
}
}
}
}
PUT /test_index/doc/1
{
"url": "http://stackoverflow.com/questions/ask"
}
I added term vectors in the mapping (you probably don't want to do this in production), so we can see what terms are generated:
GET /test_index/doc/1/_termvector
...
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_version": 1,
"found": true,
"took": 1,
"term_vectors": {
"url": {
"field_statistics": {
"sum_doc_freq": 4,
"doc_count": 1,
"sum_ttf": 4
},
"terms": {
"ask": {
"term_freq": 1
},
"http": {
"term_freq": 1
},
"questions": {
"term_freq": 1
},
"stackoverflow.com": {
"term_freq": 1
}
}
}
}
}
Here's the code I used:
http://sense.qbox.io/gist/669fbdd681895d7e9f8db13799865c6e8be75b11
Upvotes: 1
Reputation: 217274
The standard analyzer already does that for you.
curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'http://stackoverflow.com/questions/ask'
You get this:
{
"tokens" : [ {
"token" : "http",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "stackoverflow.com",
"start_offset" : 7,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "questions",
"start_offset" : 25,
"end_offset" : 34,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "ask",
"start_offset" : 35,
"end_offset" : 38,
"type" : "<ALPHANUM>",
"position" : 4
} ]
}
Upvotes: 0