PascalTurbo
PascalTurbo

Reputation: 2363

Match a huge list of keywords against a string using ruby

I have a lot of text documents on the one hand and a huge list of Keywords (Strings) on the other hand. Now I'm interested, which of these keywords are contained in the documents.

At the moment I'm using a monstrous auto generated regex:

keywords = %w(Key1, Key2, Key3)
regx = Regexp.new('\b(' + keywords.join('|') + ')\b','i')
documents.each |d|
    d.scan(regx)
end

This worked great for a List of a few hundred keywords but now I'm using about 50000 keywords and it's slowing down too much.

Is there a better way doing such an operation using ruby?

EDIT:

Upvotes: 2

Views: 533

Answers (2)

Alphons
Alphons

Reputation: 323

I would start using the gem: phrasie This gives you a array of words in (each) document, which you can easily match with your keywords.

have a look: https://github.com/ashleyw/phrasie

Upvotes: 0

sawa
sawa

Reputation: 168071

Convert the list of keywords to a hash:

h = {
  "foo" => true,
  "bar" => true,
  ...
  "baz" => true,
}

Then, read the document chunk by chunk (separated by space):

File.new("/path/to/file").each(" ") do
  |ws| ws.scan(/[\w']+/) do
    |w| if h.key?(w)
      # Found.
    end
  end
end

Upvotes: 1

Related Questions