Threading vs Forking (with explanation of what I want to do)

Question

So, I've reviewed a ton of articles and forums before posting this, but I keep reading conflicting answers. Firstly, OS is not an issue, I can use either Windows or Unix, whatever would be best for my problem. I have a ton of data that I need to use for read-only purposes (not sure why this would matter, but, in case it does, the data structure that I'm going to have to go through is an array of arrays of arrays of hashes whose values are also arrays). I'm essentially comparing a "query" to a ton of different "sentences" and computing their relative similarities. From these quantities (several million), I want to take the top x% and do something with them. I need to parallelize this process. There's just no good way for me to decrease the space--I need to compare over everything to get good results and it will just take too long with some sort of threading/forking. Again, I've seen many conflicting answers and don't know which one to do.

Any help would be appreciated. Thanks in advance.

EDIT: I don't think the amount of memory usage will be an issue, but I don't know (8 GB RAM)

Schwern · Accepted Answer

Without more detail on your problem, there's not much help that can be given. You want to parallelize a process. Threads and forks in Perl have advantages and disadvantages.

One of the key things that makes Perl threads different from other threads is that data is not shared by default. This makes threads much easier and safer to work with, you don't have to worry about thread safety of libraries or most of your code, just the threaded bit. However it can be a performance drag and memory hungry as Perl must put a copy of the interpreter and all loaded modules into each thread.

When it comes to forking I will only be talking about Unix. Perl emulates fork on Windows using threads, it works but it can be slow and buggy.

Forking Advantages

Very fast to create a fork
Very robust

Forking Disadvantages

Communicating between the processes can be slow and awkward

Thread Advantages

Thread coordination and data interchange is fairly easy
Threads are fairly easy to use

Thread Disadvantages

Each thread takes a lot of memory
Threads can be slow to start
Threads can be buggy (better the more recent your perl)
Database connections are not shared across threads

That last one is a bit of a doozy if the documentation is up to date. If you're going to be doing a lot of SQL, don't use threads.

In general, to get good performance out of Perl threads it's best to start a pool of threads and reuse them. Forks can more easily be created, used and discarded.

Really what it comes down to is what fits your way of thinking and your particular problem.

For either case, you're likely going to want something to manage your pool of workers. For forking you're going to want to use Parallel::ForkManager or Child. Child is particularly nice as it has built in inter-process communication.

For threads you're going to want to use threads::shared, Thread::Queue and read perlthrtut.

When reading articles about Perl threads, keep in mind they were a bit crap when they were introduced in 5.8.0 in 2002, and only serviceable by 5.10.1. After that they've firmed up considerably. Information and opinions about their efficiency and robustness tends to fall rapidly out of date.

Threading vs Forking (with explanation of what I want to do)

Answers (2)

Related Questions