Reputation: 2363
I have a lot of text documents on the one hand and a huge list of Keywords (Strings) on the other hand. Now I'm interested, which of these keywords are contained in the documents.
At the moment I'm using a monstrous auto generated regex:
keywords = %w(Key1, Key2, Key3)
regx = Regexp.new('\b(' + keywords.join('|') + ')\b','i')
documents.each |d|
d.scan(regx)
end
This worked great for a List of a few hundred keywords but now I'm using about 50000 keywords and it's slowing down too much.
Is there a better way doing such an operation using ruby?
EDIT:
Upvotes: 2
Views: 533
Reputation: 323
I would start using the gem: phrasie This gives you a array of words in (each) document, which you can easily match with your keywords.
have a look: https://github.com/ashleyw/phrasie
Upvotes: 0
Reputation: 168071
Convert the list of keywords to a hash:
h = {
"foo" => true,
"bar" => true,
...
"baz" => true,
}
Then, read the document chunk by chunk (separated by space):
File.new("/path/to/file").each(" ") do
|ws| ws.scan(/[\w']+/) do
|w| if h.key?(w)
# Found.
end
end
end
Upvotes: 1