Reputation: 551
Recently I experienced some serious performance issues within my php application which tries to do some stuff with more than a ten thousand lines of CSV-data. Basically I have around ten functions which consists of different preg_match / preg_replace actions which process one column of each line of the parsed csv-data (NLP DateParser, various string modifying things, image recognition from html source, etc).
Because I am reaching spheres where processing of the script is really slow (between 50 and 120 seconds) also as memory issues (too complex objects) its now time for some performance improvements ;)
So I came over pthreads which allows multi threading within a php script. But I am not really sure if this helps in my situation or just producing more performance issues than it solves (through overhead of thread handling):
My idea is to loop through all ten thousand lines and start a thread for each column processing step (10k lines + 10 columns = 10k * 10 = 100.000 threads). Do you think that this will lead to performance improvements? Or should I rather split the csv-data in to chunks (lets say 200 rows) which will be processed in separate threads (10k lines / 200 chunks = 50 threads)?
I would have attached an image consisting my profiled php script where you can see which functions are consuming much time but I sadly have not enough reputation points :/
Is there any potential in parallel processing? Can I add Attributes to the same object directly from different threads or do I have to synchronize first (therefore wait for the threads to complete)? Is it possible to read the same file in multiple threads (first hunderd rows thread 1, second hundred in thread 2, etc.) and building one big object containing all rows at the end of all processing steps?
I hope my bad English does not prevent you from understanding my thoughts and questions...
Thanks in advice!
mfuesslin
EDIT: I am not sure about the bottlenecks: Guessing the biggest one is the handling of the big object resulting from processing all the csv data... Profiler got my attention to some redundant foreach loops which I was able to ommit. But the main problem is the amount of data I have to process. All processing functions need not that much time (but if you process 10k in a row...).
The idea of operating with in-memory db instead of csv is nice - will try that out.
The preg_* functions cannot be replaced by str_* functions because I need pattern recognition.
I will also try Gearman and try to separate each data processing step in individual jobs.
PHP Version is 5.6.10 with opcache enabled.
Upvotes: 2
Views: 540
Reputation: 41796
Sounds like you want to pull out a really big gun. I'm not sure, that pthreads will solve all the problems. I will not go into details on how to apply pthreads, because there a lot of things going on here and it seems that there is still some room for improvement on the existing solution.
preg_*()
functions: try to replace them with string functionsUpvotes: 2