Duong Bach
Duong Bach

Reputation: 155

Ruby Looking Array of hash Performance

Currently I face this question For example I have this array of hash

data = [
  {:id => 1,:start_date => "2015-01-02",:end_date => "2015-01-05"},
  {:id => 2,:start_date => "2015-01-06",:end_date => "2015-01-07"},
  {:id => 3,:start_date => "2015-01-10",:end_date => "2015-01-20"}
]

So I want to find the exact hash that have "2015-01-04" in range of above hashes's start_date and end_date

Follow the document I find out there are 3 ways to do this

1) Use select

finding_hash = data.select {|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}

finding_hash will return an array of needed hash But as i do this,i assure that there will always only one hash match the condition do after do this SELECT i have to finding_hash.first to get the hash i want

2)Use find

finding_hash = data.find{|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}

This way of doing ,the finding_hash IS the result hash i need

3)Traditional loop

data.each do |t|
  if (t[:start_date] <= "2015-01-04" && t[:end_date] >= "2015-01-04")
    return t
    break
  end
end

So which one is the fastest way to do this.I do need the performance because my data is quite big!

Thank you and sorry for my bad english!

Upvotes: 4

Views: 311

Answers (4)

pangpang
pangpang

Reputation: 8821

You can test by benchmark

For example:

require 'benchmark'

n = 1000000

data = [
  {:id => 1,:start_date => "2015-01-02",:end_date => "2015-01-05"},
  {:id => 2,:start_date => "2015-01-06",:end_date => "2015-01-07"},
  {:id => 3,:start_date => "2015-01-10",:end_date => "2015-01-20"}
]


Benchmark.bm do |x|

x.report { n.times do
   data.select {|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}
   end
}

x.report { n.times do
 data.find{|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}
  end

 }

x.report {
n.times do
   finding_hash = {}
   data.each do |t|
     if (t[:start_date] <= "2015-01-04" && t[:end_date] >= "2015-01-04")
       finding_hash = t
       break
     end
    end
end
}

end

output:

       user     system      total        real
   1.490000   0.020000   1.510000 (  1.533589)
   1.070000   0.010000   1.080000 (  1.096578)
   1.000000   0.010000   1.010000 (  1.011021)

Test results is related to the value of n and the data size.

Upvotes: 2

lx00st
lx00st

Reputation: 1596

All those variants are O(n) complexity. If your ranges are not overlapping you can use bsearch of array which is O(log n) complexity. You should sort your ranges at first.

sorted = data.sort_by { |x| x[:start_date] }
sorted.bsearch { |x| ..check if range of `x` includes value.. }

Upvotes: 1

infused
infused

Reputation: 24337

All the methods you've tried are Enumerable methods, but the native Array methods are faster. Try find_index. Even after having to make a separate call to load the hash it's still about 20% faster than the next fastest:

index = data.find_index {|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}
x = data[index]

My benchmarks:

n = 1_000_000

data = [
  {:id => 1,:start_date => "2015-01-02",:end_date => "2015-01-05"},
  {:id => 2,:start_date => "2015-01-06",:end_date => "2015-01-07"},
  {:id => 3,:start_date => "2015-01-10",:end_date => "2015-01-20"}
]

Benchmark.bm do |x|
  x.report 'Enumerable#select' do
    n.times do
      data.select do |h|
        h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"
      end
    end
  end

  x.report 'Enumerable#detect' do
    n.times do
      data.detect do |h|
        h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"
      end
    end
  end

  x.report 'Enumerable#each  ' do
    n.times do
      finding_hash = {}
      data.each do |t|
        if (t[:start_date] <= "2015-01-04" && t[:end_date] >= "2015-01-04")
          finding_hash = t
          break t
        end
      end
    end
  end

  x.report 'Array#find_index ' do
    n.times do
       index = data.find_index {|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}
       x = data[index]
    end
  end
end

Results are:

Enumerable#select  1.000000   0.010000   1.010000 (  1.002282)
Enumerable#detect  0.790000   0.000000   0.790000 (  0.797319)
Enumerable#each    0.620000   0.000000   0.620000 (  0.627272)
Array#find_index   0.520000   0.000000   0.520000 (  0.515691)

Upvotes: 2

B Seven
B Seven

Reputation: 45943

v3 is fastest:

def v1
  @data.select {|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}
end

def v2
  @data.find{|h| h[:start_date] <= "2015-01-04" && h[:end_date] >= "2015-01-04"}
end

def v3
  @data.each do |t|
    if (t[:start_date] <= "2015-01-04" && t[:end_date] >= "2015-01-04")
      return t
      break
    end
  end
end

select will always be slowest because it has to iterate through the entire array. I'm not sure why find is slower than v3. It may have to do with overhead.

However, find and v3 may be the same for your data. The results below are not necessarily valid for your data.

t = Time.now; 10000.times{ v1 }; Time.now - t
=> 0.014131

t = Time.now; 10000.times{ v2 }; Time.now - t
=> 0.013138

t = Time.now; 10000.times{ v3 }; Time.now - t
=> 0.008799

Running this on the sample data is not the same thing as running it on your real data.

If the real data is too large, you can run it on a subset of the data to get a better answer.

BTW, you can rewrite v3 as:

data.each do |t|
  break t if (t[:start_date] <= "2015-01-04" && t[:end_date] >= "2015-01-04")
end

FWIW, operating on an array is going to be very unwieldy and slow. You may want to save it in a database and run a query. For a large dataset, that would probably be at least 2 orders of magnitude faster.

Upvotes: 1

Related Questions