Martin Braun
Martin Braun

Reputation: 602

Solr Performance for many documents query

I want to have Solr always retrieve all documents found by a search (I know Solr wasn't built for that, but anyways) and I am currently doing this with this code:

    ...
    List<Article> ret = new ArrayList<Article>();
    QueryResponse response = solr.query(query);
    int offset = 0;
    int totalResults = (int) response.getResults().getNumFound();
    List<Article> ret = new ArrayList<Article>((int) totalResults);
    query.setRows(FETCH_SIZE);
    while(offset < totalResults) {
        //requires an int? wtf?
        query.setStart((int) offset);
        int left = totalResults - offset;
        if(left < FETCH_SIZE) {
            query.setRows(left);
        }
        response = solr.query(query);
        List<Article> current = response.getBeans(Article.class);
        offset += current.size();
        ret.addAll(current);
    }
   ...

This works, but is pretty slow if a query gets over 1000 hits (I've read about that on here. This is being caused by Solr because I am setting the start everytime which - for some reason - takes some time). What would be a nicer (and faster) ways to do this?

Upvotes: 1

Views: 4471

Answers (3)

Ramesh
Ramesh

Reputation: 210

Use below logic to fetch solr data as batch to optimize performance of solr data fetch query:

public List<Map<String, Object>> getData(int id,Set<String> fields){
        final int SOLR_QUERY_MAX_ROWS = 3;
        long start = System.currentTimeMillis();
        SolrQuery query = new SolrQuery();
        String queryStr = "id:" + id;
        LOG.info(queryStr);
        query.setQuery(queryStr);
        query.setRows(SOLR_QUERY_MAX_ROWS);
        QueryResponse rsp = server.query(query, SolrRequest.METHOD.POST);
        List<Map<String, Object>> mapList = null;
        if (rsp != null) {
            long total = rsp.getResults().getNumFound();
            System.out.println("Total count found: " + total);
            // Solr query batch
            mapList = new ArrayList<Map<String, Object>>();
            if (total <= SOLR_QUERY_MAX_ROWS) {
                addAllData(mapList, rsp,fields);
            } else {
                int marker = SOLR_QUERY_MAX_ROWS;
                do {
                    if (rsp != null) {
                        addAllData(mapList, rsp,fields);
                    }
                    query.setStart(marker);
                    rsp = server.query(query, SolrRequest.METHOD.POST);
                    marker = marker + SOLR_QUERY_MAX_ROWS;
                } while (marker <= total);
            }
        }

        long end = System.currentTimeMillis();
        LOG.debug("SOLR Performance: getData: " + (end - start));

        return mapList;
    }

private void addAllData(List<Map<String, Object>> mapList, QueryResponse rsp,Set<String> fields) {
            for (SolrDocument sdoc : rsp.getResults()) {
                Map<String, Object> map = new HashMap<String, Object>();
            for (String field : fields) {
                map.put(field, sdoc.getFieldValue(field));
            }
            mapList.add(map);
        }
    }

Upvotes: 0

cheffe
cheffe

Reputation: 9500

To improve the suggested answer you could use a streamed response. This has been added especially for the case that one fetches all results. As you can see in Solr's Jira that guy wants to do the same as you do. This has been implemented for Solr 4.

This is also described in Solrj's javadoc.

Solr will pack the response and create a whole XML/JSON document before it starts sending the response. Then your client is required to unpack all that and offer it as a list to you. By using streaming and parallel processing, which you can do when using such a queued approach, the performance should improve further.

Yes, you will loose the automatic bean mapping, but as performance is a factor here, I think this is acceptable.

Here is a sample unit test:

public class StreamingTest {

  @Test
  public void streaming() throws SolrServerException, IOException, InterruptedException {
    HttpSolrServer server = new HttpSolrServer("http://your-server");
    SolrQuery tmpQuery = new SolrQuery("your query");
    tmpQuery.setRows(Integer.MAX_VALUE);
    final BlockingQueue<SolrDocument> tmpQueue = new LinkedBlockingQueue<SolrDocument>();
    server.queryAndStreamResponse(tmpQuery, new MyCallbackHander(tmpQueue));
    SolrDocument tmpDoc;
    do {
      tmpDoc = tmpQueue.take();
    } while (!(tmpDoc instanceof PoisonDoc));
  }

  private class PoisonDoc extends SolrDocument {
    // marker to finish queuing
  }

  private class MyCallbackHander extends StreamingResponseCallback {
    private BlockingQueue<SolrDocument> queue;
    private long currentPosition;
    private long numFound;

    public MyCallbackHander(BlockingQueue<SolrDocument> aQueue) {
      queue = aQueue;
    }

    @Override
    public void streamDocListInfo(long aNumFound, long aStart, Float aMaxScore) {
      // called before start of streaming
      // probably use for some statistics
      currentPosition = aStart;
      numFound = aNumFound;
      if (numFound == 0) {
        queue.add(new PoisonDoc());
      }
    }

    @Override
    public void streamSolrDocument(SolrDocument aDoc) {
      currentPosition++;
      System.out.println("adding doc " + currentPosition + " of " + numFound);
      queue.add(aDoc);
      if (currentPosition == numFound) {
        queue.add(new PoisonDoc());
      }
    }
  }
}

Upvotes: 8

femtoRgon
femtoRgon

Reputation: 33341

You might improve performance by increasing FETCH_SIZE. Since you are getting all the results, pagination doesn't make sense unless you are concerned with memory or some such. If 1000 results are liable to cause a memory overflow, I'd say your current performance seems pretty outstanding though.

So I would try getting everything at once, simplifying this to something like:

//WHOLE_BUNCHES is a constant representing a reasonable max number of docs we want to pull here.
//Integer.MAX_VALUE would probably invite an OutOfMemoryError, but that would be true of the
//implementation in the question anyway, since they were still being stored in the list at the end.
query.setRows(WHOLE_BUNCHES);
QueryResponse response = solr.query(query);
int totalResults = (int) response.getResults().getNumFound(); //If you even still need this figure.
List<Article> ret = response.getBeans(Article.class);

If you need to keep the pagination though:

You are performing this first query:

QueryResponse response = solr.query(query);

and are populating the number of found results from it, but you are not pulling any results with the response. Even if you keep pagination here, you could at least eliminate one extra query here.

This:

int left = totalResults - offset;
if(left < FETCH_SIZE) {
    query.setRows(left);
}

Is unnecessary. setRows specifies a Maximum number of rows to return, so asking for more than are available won't cause any problems.

Finally, apropos of nothing, but I have to ask: what argument would you expect setStart to take if not an int?

Upvotes: 1

Related Questions