bytebiscuit
bytebiscuit

Reputation: 3496

iterating through table in Ruby using hash runs slow

I have the following code for

h2.each {|k, v|
   @count += 1
   puts @count
   sq.each do |word|
       if Wordsdoc.find_by_docid(k).tf.include?(word)
       sum += Wordsdoc.find_by_docid(k).tf[word] * @s[word]
       end
     end
   rec_hash[k] = sum
   sum = 0
   }

h2 -> is a hash that contain ids of documents, the hash contains more than a 1000 of these Wordsdoc -> is a model/table in my database... sq -> is a hash that contain around 10 words

What i'm doing is i'm going through each of the document ids and then for each word in sq i look up in the Wordsdoc table if the word exists (Wordsdoc.find_by_docid(k).tf.include?(word) , here tf is a hash of {word => value}

and if it does I get the value of that word in Wordsdoc and multiple it with the value of the word in @s which is also a hash of {word = > value}

This seems to be running very slow. Tt processe one document per second. Is there a way to process this faster?

thanks really appreciate your help on this!

Upvotes: 1

Views: 285

Answers (3)

luacassus
luacassus

Reputation: 6720

You're calling Wordsdoc.find_by_docid(k) twice.

You could refactor the code to:

wordsdoc = Wordsdoc.find_by_docid(k)
if wordsdoc.tf.include?(word)
  sum += wordsdoc.tf[word] * @s[word]
end

...but still it will be ugly and inefficient.

You should prefetch all records in batches, see: https://makandracards.com/makandra/1181-use-find_in_batches-to-process-many-records-without-tearing-down-the-server

For example something like that should be much more efficient:

Wordsdoc.find_in_batches(:conditions => {:docid => array_of_doc_ids}).each do |wordsdoc|
  if wordsdoc.tf.include?(word)
    sum += wordsdoc.tf[word] * @s[word]
  end
end

Also you can retrieve only certain columns from Wordsdoc table using for example :select => :tf in find_in_batches method.

Upvotes: 1

pjammer
pjammer

Reputation: 9577

As you have a lot going on I'm just going to offer you up to things to check out.

  1. A book called Eloquent Ruby deals with Documents and iterating through documents to count the number of times a word was used. All his examples are about a Document system he was maintaining and so it could even tackle other problems for you.
  2. inject is a method that could speed up what you're looking to do for the sum part, maybe.
  3. Delayed Job the whole thing if you are doing this async-ly. meaning if this is a web app, you must be timing out if you're waiting a 1000 seconds for this job to complete before it shows it's answers on the screen.

Go get em.

Upvotes: 0

Idan Arye
Idan Arye

Reputation: 12603

You do a lot of duplicate querying. While ActiveRecord can do some caching in the background to speed things up, there is a limit to what it can do, and there is no reason to make things harder for it.

The most obvious cause for slowdown is the Wordsdoc.find_by_docid(k). For each value of k, you call it 10 times, and each time you call it there is a possibility to call it again. That means you call that method with the same argument 10-20 times for each entry in h2. Queries to the database are expensive, since the database is on the hard disk, and accessing the hard disk is expensive in any system. You can just as easily call Wordsdoc.find_by_Docid(k) once, before you enter the sq.each loop, and store it in a variable - that would save a lot of querying and make your loop go much faster.

Another optimization - though not nearly as important as the first one - is to get all the Wordsdoc records in a single query. Almost all mid to high level(and some of the low level, too!) programming languages and libraries work better and faster when they work in bulks, and ActiveRecord is no exception. If you can query for all entries of Wordsdoc, and filter them by the docid's in h2's keys, you can turn 1000 queries(after the first optimization. Before the first optimization it was 10000-20000 queries) to a single, huge query. That will enable ActiveRerocd and the underlying database to retrieve your data in bigger chunks, and save you a lot of disc access.

There are some more minor optimization you can do, but the two I've specified should be more than enough.

Upvotes: 2

Related Questions