Florian Eckerstorfer
Florian Eckerstorfer

Reputation: 1526

Insert large amount of data into MongoDB collection with Doctrine

I am using Doctrine and MongoDB for an application and there is one task that should import data from an CSV file into a collection. There are about 5 different CSV files with at least 450.000 entries per file that should be important 1 to 3 times per year.

Currently I iterate through all lines of a file, create an object, call persist() and flush every 2.500 items.

Each item is not very large, it has an ID, a string that is 10-20 characters, a string that is 5-10 characters and a boolean value.

My first question is: When I flush every 5.000 items inserting gets significantly slower. In my test environment flushing 5.000 items took about 102 seconds, flushing 2.500 items took about 10 seconds.

Flushing gets slower after a while. As mentioned, at the beginning flush 2.500 items took about 10 seconds, after 100.000 items, flushing 2.500 items takes nearly one minute.

Are there any optimisations I can do to speed things up?

Upvotes: 2

Views: 2014

Answers (2)

Boris Guéry
Boris Guéry

Reputation: 47585

I think there are two parts that may be optimized in your script.

  • The way you read your CSV files, depending on how you load it, you either totally load it in memory (using file_get_contents() or file() for example), or reading it chunck by chunck with fopen(), fread().

The last option is prefered because it will only take the necessary amount of memory when processing a bunch of lines.

  • You need to clear() the object you already processed, else it will be kept in memory until the end of your script. This means that if one DC2 object uses 1Mo of memory, and you have 100,000 objects, at the end of your scripts it will use 100,000Mo. So batching your insert by range of 2,500 is quite a good idea but you obviously need to remove the processed object from the EntityManager.

It can be done using $entityManager->clear();

The clear() will clear the whole EntityManager, you want to clear a single entity, you may use $entityManager->detach($object).


If you want to profile your memory usage you may also be interested in the functions

memory_get_usage()

memory_get_peak_usage()

Upvotes: 3

Clarence
Clarence

Reputation: 2964

Is there any reason this import script needs to use the Doctrine ODM? For the application as a whole - sure. But handling big ammounts of data with so much overhead is not optimal as PHP is not so efficient in holding and handling big collections.

The best optimization you could do is to skip that layer and use the mongodb classes directly and make sure you read the file line by line rather than all at once. Also using the batchInsert() will speed you up alot if you have some network latency to MongoDB (but you don't ever need to push it to 5k documents).

Upvotes: 1

Related Questions