Reputation: 1015
I am writing a program that loads data from four XML files into four different data structures. It has methods like this:
def loadFirst(year)
File.open("games_#{year}.xml",'r') do |f|
doc = REXML::Document.new f
...
end
end
def loadSecond(year)
File.open("teams_#{year}.xml",'r') do |f|
doc = REXML::Document.new f
...
end
end
etc...
I originally just used one thread and loaded one file after another:
def loadData(year)
time = Time.now
loadFirst(year)
loadSecond(year)
loadThird(year)
loadFourth(year)
puts Time.now - time
end
Then I realized that I should be using multiple threads. My expectation was that loading from each file on a separate thread would be very close to four times as fast as doing it all sequentially (I have a MacBook Pro with an i7 processor):
def loadData(year)
time = Time.now
t1 = Thread.start{loadFirst(year)}
t2 = Thread.start{loadSecond(year)}
t3 = Thread.start{loadThird(year)}
loadFourth(year)
t1.join
t2.join
t3.join
puts Time.now - time
end
What I found was that the version using multiple threads is actually slower than the other. How can this possibly be? The difference is around 20 seconds with each taking around 2 to 3 minutes.
There are no shared resources between the threads. Each opens a different data file and loads data into a different data structure than the others.
Upvotes: 5
Views: 1839
Reputation: 329
I think (but I'm not sure) the problem is that you are reading (using multiple threads) contents placed on the same disk, so all your threads can't run simultaneously because they wait for IO (disk).
Some days ago I had to do a similar thing (but fetching data from network) and the difference between sequential vs threads was huge.
A possible solution could be to load all file content instead of load it like you did in your code. In your code you read contents line by line. If you load all the content and then process it you should be able to perform much better (because threads should not wait for IO)
Upvotes: 3
Reputation: 30465
It's impossible to give a conclusive answer to why your parallel problem is slower than the sequential one without a lot more information, but one possibility is:
With the sequential program, your disk seeks to the first file, reads it all out, seeks to the 2nd file, reads it all out, and so on.
With the parallel program, the disk head keeps moving back and forth trying to service I/O requests from all 4 threads.
I don't know if there's any way to measure disk seek time on your system: if so, you could confirm whether this hypothesis is true.
Upvotes: 0