Reputation: 696
Solr Version: 6.3.0
Cloud: Yes
Shards: Single(1)
Data Size: 50GB
Records: 12M
We have a Solr Join query which tries to find the related ids from the given collection(Yes self join). This is causing a performance hit.
On analysis found that, Solr is scanning all the terms from the from_field irrespective of the q filter mentioned and then tries to do intersect with the to_field terms. Is there a way by which we can ask solr to filter the terms before doing intersect to the to_field in Join parser?
We have around 9M terms for the given solr field, which we assume to be cause for the the performance hit.
"join": {
"{!join from=from_field to=to_field fromIndex=insight_pats_1_shard1_replica1}to_field: \u0001\u0000\u0000\u0000\u0000\u0000\u0003X\u0002H": {
"time": 16824,
"fromSetSize": 1,
"toSetSize": 0,
"fromTermCount": 8561723,
"fromTermTotalDf": 8561723,
"fromTermDirectCount": 8561505,
"fromTermHits": 0,
"fromTermHitsTotalDf": 0,
"toTermHits": 0,
"toTermHitsTotalDf": 0,
"toTermDirectCount": 0,
"smallSetsDeferred": 0,
"toSetDocsAdded": 0
}
},
"rawquerystring": "*:*",
"querystring": "*:*",
"parsedquery": "(+MatchAllDocsQuery(*:*))/no_coord",
"parsedquery_toString": "+*:*",
"explain": { },
"QParser": "ExtendedDismaxQParser",
"altquerystring": null,
"boost_queries": null,
"parsed_boost_queries": [ ],
"boostfuncs": null,
"filter_queries": [
"account_ids:1",
"{!join from=from_field to=to_field fromIndex=insight_pats_1}to_field:7733576"
],
"parsed_filter_queries": [
"account_ids:1",
"JoinQuery({!join from=from_field to=to_field fromIndex=insight_pats_1_shard1_replica1}to_field: \u0001\u0000\u0000\u0000\u0000\u0000\u0003X\u0002H)"
]
Upvotes: 1
Views: 1008
Reputation: 696
There are two types of join parsers available
By default !join uses JoinQueryParser but is not optimal for joining records where size of Millions.
We can ask the SOLR to use ScoreJoinQParser by adding a parameter score=none in !join parser command as show below.
http://localhost:8983/solr/mycollection/select?fq={!join from=from_field to=to_field fromIndex=from_collection score=none}&indent=on&q=*:*&wt=json&debugQuery=on
We are able to achieve 30 times improvement in performance where the from_field terms are in the range of 8 Million
Upvotes: 1