koseduhemak
koseduhemak

Reputation: 551

PHP + pthreads: Build one big object - parallel-processing

Recently I experienced some serious performance issues within my php application which tries to do some stuff with more than a ten thousand lines of CSV-data. Basically I have around ten functions which consists of different preg_match / preg_replace actions which process one column of each line of the parsed csv-data (NLP DateParser, various string modifying things, image recognition from html source, etc).

Because I am reaching spheres where processing of the script is really slow (between 50 and 120 seconds) also as memory issues (too complex objects) its now time for some performance improvements ;)

So I came over pthreads which allows multi threading within a php script. But I am not really sure if this helps in my situation or just producing more performance issues than it solves (through overhead of thread handling):

My idea is to loop through all ten thousand lines and start a thread for each column processing step (10k lines + 10 columns = 10k * 10 = 100.000 threads). Do you think that this will lead to performance improvements? Or should I rather split the csv-data in to chunks (lets say 200 rows) which will be processed in separate threads (10k lines / 200 chunks = 50 threads)?

I would have attached an image consisting my profiled php script where you can see which functions are consuming much time but I sadly have not enough reputation points :/

Is there any potential in parallel processing? Can I add Attributes to the same object directly from different threads or do I have to synchronize first (therefore wait for the threads to complete)? Is it possible to read the same file in multiple threads (first hunderd rows thread 1, second hundred in thread 2, etc.) and building one big object containing all rows at the end of all processing steps?

I hope my bad English does not prevent you from understanding my thoughts and questions...

Thanks in advice!

mfuesslin

EDIT: I am not sure about the bottlenecks: Guessing the biggest one is the handling of the big object resulting from processing all the csv data... Profiler got my attention to some redundant foreach loops which I was able to ommit. But the main problem is the amount of data I have to process. All processing functions need not that much time (but if you process 10k in a row...).

The idea of operating with in-memory db instead of csv is nice - will try that out.

The preg_* functions cannot be replaced by str_* functions because I need pattern recognition.

I will also try Gearman and try to separate each data processing step in individual jobs.

PHP Version is 5.6.10 with opcache enabled.

Upvotes: 2

Views: 540

Answers (1)

Jens A. Koch
Jens A. Koch

Reputation: 41796

Sounds like you want to pull out a really big gun. I'm not sure, that pthreads will solve all the problems. I will not go into details on how to apply pthreads, because there a lot of things going on here and it seems that there is still some room for improvement on the existing solution.

  • Where exactly is the bottleneck? Did you profile your code and work on the bottlenecks?
  • CSV
    • maybe you can drop it and import the CSV data into a Db?
    • e.g., what about using a SQLite in-memory Db for processing
    • your are lowering the memory footprint of your CSV parser by using chunked-parsing?
  • you are using preg_*() functions: try to replace them with string functions
  • split your data-processing functions into clearly defined individual jobs
  • use a job/queue system for processing, like
  • what about your PHP? upgraded to 5.6? opcache enabled?

Upvotes: 2

Related Questions