drew.cuthbert
drew.cuthbert

Reputation: 1015

Ruby performance with multiple threads vs one thread

I am writing a program that loads data from four XML files into four different data structures. It has methods like this:

def loadFirst(year)
  File.open("games_#{year}.xml",'r') do |f|
    doc = REXML::Document.new f
    ...
  end
end
def loadSecond(year)
  File.open("teams_#{year}.xml",'r') do |f|
    doc = REXML::Document.new f
    ...
  end
end

etc...

I originally just used one thread and loaded one file after another:

def loadData(year)
  time = Time.now
  loadFirst(year)
  loadSecond(year)
  loadThird(year)
  loadFourth(year)
  puts Time.now - time
end

Then I realized that I should be using multiple threads. My expectation was that loading from each file on a separate thread would be very close to four times as fast as doing it all sequentially (I have a MacBook Pro with an i7 processor):

def loadData(year)
  time = Time.now
  t1 = Thread.start{loadFirst(year)}
  t2 = Thread.start{loadSecond(year)}
  t3 = Thread.start{loadThird(year)}
  loadFourth(year)
  t1.join
  t2.join
  t3.join
  puts Time.now - time
end

What I found was that the version using multiple threads is actually slower than the other. How can this possibly be? The difference is around 20 seconds with each taking around 2 to 3 minutes.

There are no shared resources between the threads. Each opens a different data file and loads data into a different data structure than the others.

Upvotes: 5

Views: 1839

Answers (2)

SDp
SDp

Reputation: 329

I think (but I'm not sure) the problem is that you are reading (using multiple threads) contents placed on the same disk, so all your threads can't run simultaneously because they wait for IO (disk).

Some days ago I had to do a similar thing (but fetching data from network) and the difference between sequential vs threads was huge.

A possible solution could be to load all file content instead of load it like you did in your code. In your code you read contents line by line. If you load all the content and then process it you should be able to perform much better (because threads should not wait for IO)

Upvotes: 3

Alex D
Alex D

Reputation: 30465

It's impossible to give a conclusive answer to why your parallel problem is slower than the sequential one without a lot more information, but one possibility is:

With the sequential program, your disk seeks to the first file, reads it all out, seeks to the 2nd file, reads it all out, and so on.

With the parallel program, the disk head keeps moving back and forth trying to service I/O requests from all 4 threads.

I don't know if there's any way to measure disk seek time on your system: if so, you could confirm whether this hypothesis is true.

Upvotes: 0

Related Questions