Reputation: 1526
I am using Doctrine and MongoDB for an application and there is one task that should import data from an CSV file into a collection. There are about 5 different CSV files with at least 450.000 entries per file that should be important 1 to 3 times per year.
Currently I iterate through all lines of a file, create an object, call persist()
and flush every 2.500 items.
Each item is not very large, it has an ID, a string that is 10-20 characters, a string that is 5-10 characters and a boolean value.
My first question is: When I flush every 5.000 items inserting gets significantly slower. In my test environment flushing 5.000 items took about 102 seconds, flushing 2.500 items took about 10 seconds.
Flushing gets slower after a while. As mentioned, at the beginning flush 2.500 items took about 10 seconds, after 100.000 items, flushing 2.500 items takes nearly one minute.
Are there any optimisations I can do to speed things up?
Upvotes: 2
Views: 2014
Reputation: 47585
I think there are two parts that may be optimized in your script.
file_get_contents()
or file()
for example), or reading it chunck by chunck with fopen()
, fread()
.The last option is prefered because it will only take the necessary amount of memory when processing a bunch of lines.
clear()
the object you already processed, else it will be kept in memory until the end of your script. This means that if one DC2 object uses 1Mo of memory, and you have 100,000 objects, at the end of your scripts it will use 100,000Mo.
So batching your insert by range of 2,500 is quite a good idea but you obviously need to remove the processed object from the EntityManager.It can be done using $entityManager->clear();
The clear()
will clear the whole EntityManager
, you want to clear a single entity, you may use $entityManager->detach($object)
.
If you want to profile your memory usage you may also be interested in the functions
Upvotes: 3
Reputation: 2964
Is there any reason this import script needs to use the Doctrine ODM? For the application as a whole - sure. But handling big ammounts of data with so much overhead is not optimal as PHP is not so efficient in holding and handling big collections.
The best optimization you could do is to skip that layer and use the mongodb classes directly and make sure you read the file line by line rather than all at once. Also using the batchInsert() will speed you up alot if you have some network latency to MongoDB (but you don't ever need to push it to 5k documents).
Upvotes: 1