Reputation: 84
I try to use Elastic Search
to find most similar tags from text.
For example, I create test_index and insert two documents:
POST test_index/_doc/17
{
"id": 17,
"tags": ["it", "devops", "server"]
}
POST test_index/_doc/20
{
"id": 20,
"tags": ["software", "hardware"]
}
So, i expect to find "software" tag (text or id) from "I'm using some softwares and applications" text.
I was hoping someone can provide an example on how to do this or at least point me in the right direction.
Thanks.
Upvotes: 0
Views: 218
Reputation: 84
If you search text that has base or root word, Stemming
is good way.
If you need to find most similar word(s) from text, Ngram
is more suitable way.
If you search exact words of text in word of tags, Shingles
is better way.
Upvotes: 0
Reputation: 8840
What you are looking for is nothing but a concept called as Stemming
. You would need to create a Custom Analyzer and make use of Stemmer Token Filter.
Please find the below mapping, sample documents, query and response:
PUT my_stem_index
{
"settings": {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "my_stemmer"]
}
},
"filter" : {
"my_stemmer" : {
"type" : "stemmer",
"name" : "english"
}
}
}
},
"mappings": {
"properties": {
"id":{
"type": "keyword"
},
"tags":{
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword":{
"type": "keyword"
}
}
}
}
}
}
From comments, it appears that you are using version < 7. For that you may have to add type
in it.
PUT my_stem_index
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"tokenizer":"standard",
"filter":[
"lowercase",
"my_stemmer"
]
}
},
"filter":{
"my_stemmer":{
"type":"stemmer",
"name":"english"
}
}
}
},
"mappings":{
"_doc":{
"properties":{
"id":{
"type":"keyword"
},
"tags":{
"type":"text",
"analyzer":"my_analyzer",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
}
}
}
}
POST my_stem_index/_doc/17
{
"id": 17,
"tags": ["it", "devops", "server"]
}
POST my_stem_index/_doc/20
{
"id": 20,
"tags": ["software", "hardware"]
}
POST my_stem_index/_doc/21
{
"id": 21,
"tags": ["softwares and applications", "hardwares and storage devices"]
}
POST my_stem_index/_search
{
"query": {
"match": {
"tags": "software"
}
}
}
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.5908618,
"hits" : [
{
"_index" : "my_stem_index",
"_type" : "_doc",
"_id" : "20",
"_score" : 0.5908618,
"_source" : {
"id" : 20,
"tags" : [
"software",
"hardware"
]
}
},
{
"_index" : "my_stem_index",
"_type" : "_doc",
"_id" : "21",
"_score" : 0.35965496,
"_source" : {
"id" : 21,
"tags" : [
"softwares and applications", <--- Note this has how `softwares` also was searchable.
"hardwares and storage devices"
]
}
}
]
}
}
Notice in response as how both the documents i.e. having _id 20
and 21
appear.
If you are new to Elasticsearch, I'd suggest spending sometime to understand the concept of Analysis and how Elasticsearch implements the same using Analyzers
.
This would help you understand how the document with softwares and applications
is also returning when you only query for software
and or vice versa.
Hope this helps!
Upvotes: 2