Rodion Gorkovenko
Rodion Gorkovenko

Reputation: 2852

Iterate over large collection in MongoDB via spring-data

Friends!

I am using MongoDB in java project via spring-data. I use Repository interfaces to access data in collections. For some processing I need to iterate over all elements of collection. I can use fetchAll method of repository, but it always return ArrayList.

However, it is supposed that one of collections would be large - up to 1 million records several kilobytes each at least. I suppose I should not use fetchAll in such cases, but I could not find neither convenient methods returning some iterator (which may allow collection to be fetched partially), nor convenient methods with callbacks.

I've seen only support for retrieving such collections in pages. I wonder whether it is the only way for working with such collections?

Upvotes: 37

Views: 47509

Answers (10)

Jeryl Cook
Jeryl Cook

Reputation: 1038

you can still use mongoTemplate to access the Collection and simply use DBCursor:

     DBCollection collection = mongoTemplate.getCollection("boundary");
     DBCursor cursor = collection.find();        
     while(cursor.hasNext()){
         DBObject obj = cursor.next();
         Object object =  obj.get("polygons");

*Updated in 2023. getObjectMapper() is Jackson Object Mapper. you must register the Spring Data GeoJsonModule as a serizlier/deserlizer.

    MongoCollection<org.bson.Document> collection = mongoTemplate.getCollection("postals");
    FindIterable<Document>  cursor = collection.find();             
    cursor.forEach(obj -> {
        Document geoJsonPolygonDocument = (Document)obj.get("polygon");
        if(geoJsonPolygonDocument != null) {
            GeoJsonPolygon geoJsonPolygon = getObjectMapper().readValue(geoJsonPolygonDocument.toJson(),GeoJsonPolygon.class);           
        }

    });
    

Upvotes: 13

Florian F.
Florian F.

Reputation: 11

Du to Mongo DB Cursor capabilities, if you have long process, you can lost cursor...

I recommend to use paging :

        final int pageSize = 1000;
        var paging = Pageable.ofSize(pageSize);
        do {
            Page<T> page = repository.findAll(paging); // Retrieve page items
            page.forEach((item) -> this.processItem(item);); // Do item job
            
            // page++
            paging = page.nextPageable(); // If last: return Pageable.unpaged()
        }
        while (paging.isPaged()); // If last: Unpaged.isPaged() return false

And for repository, 2 options:

    // Use Spring Data Interface
    @Repository
    public interface YourDao extends PagingAndSortingRepository<T, ID> {
        // extends create this impl
        // Page<T> findAll(Pageable pageable);
    }

    // Or create your own Impl
    public class YourDaoImpl implements YourDao {
        @Override
        public Page<T> findAll(Pageable pageable) {
            final var query = new Query().with(pageable);
            var items = mongoTemplate.find(query, T.class);
            return PageableExecutionUtils.getPage(
                  items, 
                  pageable, 
                  () -> mongoTemplate.count(Query.of(query).limit(-1).skip(-1), T.class));
        }
    }

Upvotes: 0

Maysam Mok
Maysam Mok

Reputation: 369

Since this question got bumped recently, this answer needs some more love!

If you use Spring Data Repository interfaces, you can declare a custom method that returns a Stream, and it will be implemented by Spring Data using cursors:

import java.util.Stream;

public interface AlarmRepository extends CrudRepository<Alarm, String> {

    Stream<Alarm> findAllBy();

}

So for the large amount of data you can stream them and process the line by line without memory limitation.

See https://docs.spring.io/spring-data/mongodb/docs/current/reference/html/#mongodb.repositories.queries

Upvotes: 23

jspek
jspek

Reputation: 446

This answer is based on: https://stackoverflow.com/a/22711715/5622596

That answer needs a bit of an update as PageRequest has changed how it is being constructed.

With that said here is my modified response:

int pageNumber = 1;

//Change value to whatever size you want the page to have
int pageLimit = 100;

Page<SomeClass> page;
List<SomeClass> compondList= new LinkedList<>();

do{
    PageRequest pageRequest = PageRequest.of(pageNumber, pageLimit);
    
    page = repository.findAll(pageRequest);
    
    List<SomeClass> listFromPage = page.getContent();

    //Do something with this list example below
    compondList.addAll(listFromPage);

    pageNumber++;

  }while (!page.isLast());

//Do something with the compondList: example below
return compondList;

Upvotes: 0

Abhilash Mishra
Abhilash Mishra

Reputation: 1

The best way to iterator over a large collection is to use the Mongo API directly. I used the below code and it worked like a charm for my use-case.
I had to iterate over more than 15M records and the document size was huge for some of those.
The following code is in Kotlin Spring Boot App (Spring Boot Version: 2.4.5)

fun getAbcCursor(batchSize: Int, from: Long?, to: Long?): MongoCursor<Document> {

    val collection = xyzMongoTemplate.getCollection("abc")
    val query = Document("field1", "value1")
    if (from != null) {
        val fromDate = Date(from)
        val toDate = if (to != null) { Date(to) } else { Date() }
        query.append(
            "createTime",
            Document(
                "\$gte", fromDate
            ).append(
                "\$lte", toDate
            )
        )
    }
    return collection.find(query).batchSize(batchSize).iterator()
}

Then, from a service layer method, you can just keep calling MongoCursor.next() on returned cursor till MongoCursor.hasNext() returns true.

An Important Observation: Please do not miss adding batchSize on 'FindIterable' (the return type of MongoCollection.find()). If you won't provide the batch size, the cursor will fetch initial 101 records and will hang after that (it tries to fetch all the remaining records at once).
For my scenario, I used the batch size as 2000, as it gave the best results during testing. This optimized batch size will be impacted by the average size of your records.

Here is the equivalent code in Java (removing createTime from query as it is specific to my data model).

    MongoCursor<Document> getAbcCursor(Int batchSize) {
        MongoCollection<Document> collection = xyzMongoTemplate.getCollection("your_collection_name");
        Document query = new Document("field1", "value1");// query --> {"field1": "value1"}
        return collection.find(query).batchSize(batchSize).iterator();
    }

Upvotes: 0

udalmik
udalmik

Reputation: 7988

Late response, but maybe will help someone in the future. Spring data doesn't provide any API to wrap Mongo DB Cursor capabilities. It uses it within find methods, but always returns completed list of objects. Options are to use Mongo API directly or to use Spring Data Paging API, something like that:

        final int pageLimit = 300;
        int pageNumber = 0;
        Page<T> page = repository.findAll(new PageRequest(pageNumber, pageLimit));
        while (page.hasNextPage()) {
            processPageContent(page.getContent());
            page = repository.findAll(new PageRequest(++pageNumber, pageLimit));
        }
        // process last page
        processPageContent(page.getContent());

UPD (!) This method is not sufficient for large sets of data (see @Shawn Bush comments) Please use Mongo API directly for such cases.

Upvotes: 29

Clement.Xu
Clement.Xu

Reputation: 1308

You may want to try the DBCursor way like this:

    DBObject query = new BasicDBObject(); //setup the query criteria
    query.put("method", method);
    query.put("ctime", (new BasicDBObject("$gte", bTime)).append("$lt", eTime));

    logger.debug("query: {}", query);

    DBObject fields = new BasicDBObject(); //only get the needed fields.
    fields.put("_id", 0);
    fields.put("uId", 1);
    fields.put("ctime", 1);

    DBCursor dbCursor = mongoTemplate.getCollection("collectionName").find(query, fields);

    while (dbCursor.hasNext()){
        DBObject object = dbCursor.next();
        logger.debug("object: {}", object);
        //do something.
    }

Upvotes: 2

Segabond
Segabond

Reputation: 1143

Use MongoTemplate::stream() as probably the most appropriate Java wrapper to DBCursor

Upvotes: 12

ramon_salla
ramon_salla

Reputation: 1607

Another way:

do{
  page = repository.findAll(new PageRequest(pageNumber, pageLimit));
  pageNumber++;

}while (!page.isLastPage());

Upvotes: 4

Related Questions