Reputation: 4713
I have two documents with a field title
of:
If I search for the term new website
the score for the News document is much higher than the other one which is obviously not what I want. I wrapped an explain around it and got:
'hits': [{'_explanation': {'desc': 'product of:',
'det': [{'desc': 'sum of:',
'det': [{'desc': 'product of:',
'det': [{'desc': 'sum of:',
'det': [{'desc': 'weight(title:new in 0) [PerFieldSimilarity], result of:',
'det': [{'desc': 'score(doc=0,freq=1.0), product of:',
'det': [{'desc': 'queryWeight, product of:',
'det': [{'desc': 'idf(docFreq=1, maxDocs=6)',
'value': 2.0986123},
{'desc': 'queryNorm',
'value': 0.14544667}],
'value': 0.3052362},
{'desc': 'fieldWeight in 0, product of:',
'det': [{'desc': 'tf(freq=1.0), with freq of:',
'det': [{'desc': 'termFreq=1.0',
'value': 1.0}],
'value': 1.0},
{'desc': 'idf(docFreq=1, maxDocs=6)',
'value': 2.0986123},
{'desc': 'fieldNorm(doc=0)',
'value': 0.625}],
'value': 1.3116326}],
'value': 0.40035775}],
'value': 0.40035775}],
'value': 0.40035775},
{'desc': 'coord(1/2)',
'value': 0.5}],
'value': 0.20017888}],
'value': 0.20017888},
{'desc': 'coord(1/2)',
'value': 0.5}],
'value': 0.10008944},
'_id': '2ff1307b536102e41e7daaccaf7edc69b16a348c',
'_index': 'scrapy',
'_node': 'D9SgrDb5RnO4NMAJMHiAOA',
'_score': 0.100089446,
'_shard': 3,
'_source': {'title': ['\n News ? E/CIS\n '],
'url': 'http://178.4.12.128:8888/news/'},
'_type': 'pages'},
{'_explanation': {'desc': 'product of:',
'det': [{'desc': 'sum of:',
'det': [{'desc': 'sum of:',
'det': [{'desc': 'weight(title:new in 0) [PerFieldSimilarity], result of:',
'det': [{'desc': 'score(doc=0,freq=1.0), product of:',
'det': [{'desc': 'queryWeight, product of:',
'det': [{'desc': 'idf(docFreq=1, maxDocs=1)',
'value': 0.30685282},
{'desc': 'queryNorm',
'value': 0.46183997}],
'value': 0.1417169},
{'desc': 'fieldWeight in 0, product of:',
'det': [{'desc': 'tf(freq=1.0), with freq of:',
'det': [{'desc': 'termFreq=1.0',
'value': 1.0}],
'value': 1.0},
{'desc': 'idf(docFreq=1, maxDocs=1)',
'value': 0.30685282},
{'desc': 'fieldNorm(doc=0)',
'value': 0.5}],
'value': 0.15342641}],
'value': 0.021743115}],
'value': 0.021743115},
{'desc': 'weight(title:websit in 0) [PerFieldSimilarity], result of:',
'det': [{'desc': 'score(doc=0,freq=1.0), product of:',
'det': [{'desc': 'queryWeight, product of:',
'det': [{'desc': 'idf(docFreq=1, maxDocs=1)',
'value': 0.30685282},
{'desc': 'queryNorm',
'value': 0.46183997}],
'value': 0.1417169},
{'desc': 'fieldWeight in 0, product of:',
'det': [{'desc': 'tf(freq=1.0), with freq of:',
'det': [{'desc': 'termFreq=1.0',
'value': 1.0}],
'value': 1.0},
{'desc': 'idf(docFreq=1, maxDocs=1)',
'value': 0.30685282},
{'desc': 'fieldNorm(doc=0)',
'value': 0.5}],
'value': 0.15342641}],
'value': 0.021743115}],
'value': 0.021743115}],
'value': 0.04348623}],
'value': 0.04348623},
{'desc': 'coord(1/2)',
'value': 0.5}],
'value': 0.021743115},
'_id': '265988d175a2b4a2ae2e462509089d5f701ed372',
'_index': 'scrapy',
'_node': 'D9SgrDb5RnO4NMAJMHiAOA',
'_score': 0.021743115,
'_shard': 0,
'_source': {'title': ['\n New Website ? E/CIS\n '],
'url': 'http://178.4.12.128:8888/news/2015-new-website/'},
'_type': 'pages'}],
'max_score': 0.100089446,
'total': 2}
Note I shortened details
to det
and description
to desc
to save space.
It looks like the biggest difference is due to the difference of maxDocs in the scoring. Why do I have a difference there? I thought that this was the number of documents in the index? Shouldn't that be the same?
Full details following but they might not be needed:
My query:
'multi_match': {
'query': 'new website',
'type': 'most_fields',
'fields': ['title.raw^15', 'title^10'],
'analyzer': 'whitespace_analyzer',
}
'title': {
'type': 'string',
'store': 'yes',
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer",
'fields': {
'raw': {
'type': 'string',
'store': 'yes',
"search_analyzer": "whitespace_analyzer",
"index": "not_analyzed",
},
}
},
'analysis': {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"html_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter": ["html_strip"],
"filter": [
'english_possessive_stemmer',
"lowercase",
'english_stop',
'english_stemmer',
"asciifolding",
]
},
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter": ["html_strip"], # Strips the html tags
"filter": [
'english_possessive_stemmer',
"lowercase",
'english_stop',
'english_stemmer',
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
'english_possessive_stemmer',
"lowercase",
'english_stop',
'english_stemmer',
"asciifolding",
]
}
Upvotes: 4
Views: 411
Reputation: 17461
The default search type is query_then_fetch . Both query_then_fetch and query_and_fetch involve calculating the term and document frequency local to each of the shards in the index.
However if you want a more accurate calculation of term/document frequency one can use dfs_query_then_fetch/dfs_query_and_fetch .Here the frequency is calculated across all the shards of indexes involved.
This article gives a more detailed explanation
Upvotes: 5