Charles
Charles

Reputation: 375

Solr CollapsingQParserPlugin with group.facet=on style facet counts

I have a Solr index of about 5 million documents at 8GB using Solr 4.7.0. I require grouping in Solr, but find it to be too slow. Here is the group configuration:

group=on
group.facet=on
group.field=workId
group.ngroups=on

The machine has ample memory at 24GB and 4GB is allocated to Solr itself. Queries are generally taking about 1200ms compared to 90ms when grouping is turned off.

I ran across a plugin called CollapsingQParserPlugin which uses a filter query to remove all but one of a group.

fq={!collapse field=workId}

It's designed for indexes that have a lot of unique groups. I have about 3.8 million. This approach is much much faster at about 120ms. It's a beautiful solution for me except for one thing. Because it filters out other members of the group, only facets from the representative document are counted. For instance, if I have the following three documents:

"docs": [
  {
    "id": "1",
    "workId": "abc",
    "type": "book"
  },
  {
    "id": "2",
    "workId": "abc",
    "type": "ebook"
  },
  {
    "id": "3",
    "workId": "abc",
    "type": "ebook"
  }
]

once collapsed, only the top one shows up in the results. Because the other two get filtered out, the facet counts look like

"type": ["book":1]

instead of

"type": ["book":1, "ebook":1]

Is there a way to get group.facet counts using the collapse filter query?

Upvotes: 4

Views: 1457

Answers (2)

Charles
Charles

Reputation: 375

According to Yonik Seeley, the correct group facet counts can be gathered using the JSON Facet API. His comments can be found at:

https://issues.apache.org/jira/browse/SOLR-7036?focusedCommentId=15601789&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15601789

I tested out his method and it works great. I still use the CollapsingQParserPlugin to collapse the results, but I exclude the filter when counting up the facets like so:

fq={!tag=workId}{!collapse field=workId}

json.facet={
  type: {
    type: terms,
    field: type,
    facet: {
      workCount: "unique(workId)"
    },
    domain: {
      excludeTags: [workId]
    }
  }
}

And the result:

{  
  "facets": {  
    "count": 3,
    "type": {  
      "buckets": [  
        {  
          "val": "ebook",
          "count": 2,
          "workCount": 1
        },
        {  
          "val": "book",
          "count": 1,
          "workCount": 1
        }
      ]
    }
  }
}

Upvotes: 3

Charles
Charles

Reputation: 375

I was unable to find a way to do this with Solr or plugin configurations, so I developed a work around to effectively create group facet counts while still using the CollapsingQParserPlugin.

I do this by making a duplicate of the fields I'll be faceting on and making sure all facet values for the entire group are in each document like so:

"docs": [
  {
    "id": "1",
    "workId": "abc",
    "type": "book",
    "facetType": [
      "book",
      "ebook"
    ]
  },
  {
    "id": "2",
    "workId": "abc",
    "type": "ebook",
    "facetType": [
      "book",
      "ebook"
    ]
  },
  {
    "id": "3",
    "workId": "abc",
    "type": "ebook",
    "facetType": [
      "book",
      "ebook"
    ]
  }
]

When I ask Solr to generate facet counts, I use the new field:

facet.field=facetType

This ensures that all facet values are accounted for and that the counts represent groups. But when I use a filter query, I revert back to using the old field:

fq=type:book

This way the correct document is chosen to represent the group.

I know this is a dirty, complex way to make it work, but it does work and that's what I needed. Also it requires the ability to query your documents before insertion into Solr, which calls for some development. If anyone has a simpler solution I would still love to hear it.

Upvotes: 2

Related Questions