Reputation: 552
I have a large file almost 20GB
, more than 20 mln lines and each line represents separate serialized JSON.
Reading file line by line
as a regular loop
and performing manipulation on line data takes a lot of time.
Is there any state of art
approach or best practices
for reading large files in parallel with smaller chunks in order to make processing faster?
I'm using Python 3.6.X
Upvotes: 5
Views: 6060
Reputation: 2853
The main bottleneck you need to be aware of here is neither disk nor CPU, it is memory, not how much you have, but how the hardware and OS work together to pre-fetch pages from RAM into the L{1,2,3} caches.
The parallel approach will have worse performance than the serial approach if you use readline() to load one line at a time. The reason has to do with hardware+OS, not software. When the cpu requires a memory read, a certain amount of extra data is fetched in to the Lx caches of the CPU in anticipation that this data might be required later. When you employ the serial approach, this extra data is in fact used while it is still in the cache. But when parallel readlines() are happening, the extra data is preempted before there is a chance to use it. Hence it has to be fetched again later. This has a huge impact on performance.
The way to make the parallel approach beat the performance of the serial readline() approach is to have your parallel processes read more than one line at a time into memory. Use read() instead of readline(). How many bytes should you read? Approximately the size of your L1 cache, about 64K. With this, several pages of contiguous memory can be loaded into that cache.
However, if you replace serial readline() with serial read() this will outperform parallel read(). Why? Because although each core has its own L1 cache, the cores are sharing the other L caches and therefore you're running into the same problem. Process 1 will pre-fetch to populate the caches, but before it has time to process it all, Process 2 takes over and replaces the contents of the cache. Therefore Process 1 will have to fetch the same data again later.
The performance difference between serial and parallel can be easily seen: First make sure the whole file is in the page cache (free -m: buff/cache). By doing this you are totally removing the disk/block device layer from the equation. The whole file is in RAM. Then run a simple serial code and time it. Then run a simple parallel code using Process() and time it. On commodity systems, your serial reads will outperform your parallel reads.
This is contrary to what you expect, right? But you have to think about the assumptions you were making. Your assumptions didn't include the fact that there are caches and memory buses being shared between multiple cores. This is precisely where the bottleneck is located. We know disk isn't the bottleneck because we have loaded the entire file into page cache in RAM. We know CPU isn't the bottleneck because we are using multiprocessing.Process() which ensures simultaneous execution, one process per core (vmstat 1, top -d 1 %1).
The only way you can have better performance from the parallel approach is to make sure that you are running hardware with separate NuMA nodes where each core has its own memory bus and its own caches.
As a side note, contrary to what other answers have claimed, Python can certainly do true computational multiprocessing, where you put a 100% utilization on each core at the same time. This is done using multiprocessing.Process(). Thread Pools will not work for this because the threads are tied to a single core and are restricted by python's GIL. But multiprocessing.Process() is not restricted by the GIL and certainly does work.
Another side note: it is bad practice to try to load your entire input file into your heap. You never know if the file is larger than can fit into RAM which will cause an oom. So don't try doing this in an attempt to optimize the performance of your code.
Upvotes: 0
Reputation: 10889
There are several possibilites, but first profile your code in find the bottlenecks. Maybe your processing does some slows things which can be speed up - which would be vastly preferable to multiprocessing. If that does not help, you could try:
Use another file format. Reading serialized json from text is not the fastest operation in the world. So you could store your data (for example in hdf5) which could speed up processing.
Implement multiple worker processes which can read portions of the file (worker1 reads lines 0 - 1million, worker2 1million - 2million etc). You can orchestrate that with joblib or celery, depending on your needs. Integrating the results is the challenge, there you have to see what your needs are (map-reduce style?). This is more difficult in python due to no real threading than in other languages, so maybe you could switch the language for that.
Upvotes: 1
Reputation: 1648
Unfortunately, no. Reading in files and operating on the lines read (such as json parsing or computation) is a CPU-bound operation, so there's no clever asyncio tactics to speed it up. In theory one could utilize multiprocessing and multiple cores to read and process in parallel, but having multiple threads reading the same file is bound to cause major problems. Because your file is so large, storing it all in memory and then parallelizing the computation is also going to be difficult.
Your best bet would be to head this problem off at the pass by partitioning the data (if possible) into multiple files, which could then open up safer doors to parallelism with multiple cores. Sorry there isn't a better answer AFAIK.
Upvotes: 3