Frank Smitten
Frank Smitten

Reputation: 23

Ruby if ... any? ... include? syntax

I need to check if any elements of a large (60,000+ elements) array are present in a long string of text. My current code looks like this:

if $TARGET_PARTLIST.any? { |target_pn| pdf_content_string.include? target_pn }
  self.last_match_code = target_pn
  self.is_a_match = true
end

I get a syntax error undefined local variable or method target_pn.

Could someone let me know the correct syntax to use for this block of code? Also, if anyone knows of a quicker way to do this, I'm all ears!

Upvotes: 0

Views: 465

Answers (3)

Amadan
Amadan

Reputation: 198334

A probably more performant way would be to move all this into native code by letting Regexp search for it.

# needed only once
TARGET_PARTLIST_RE = Regexp.new("\\b(?:#{$TARGET_PARTLIST.sort.map { |pl| Regexp.escape(pl) }.join('|')})\\b")

# to check
self.last_match_code = pdf_content_string[TARGET_PARTLIST_RE]
self.is_a_match = !self.last_match_code.nil?

A much more performant way would be to build a prefix tree and create the regexp using the prefix tree (this optimises the regexp lookup), but this is a bit more work :)

Upvotes: 0

Cary Swoveland
Cary Swoveland

Reputation: 110685

You should use Enumerable#find rather than Enumerable#any?.

found = $TARGET_PARTLIST.find { |target_pn| pdf_content_string.include? target_pn }
if found
  self.last_match_code = found
  self.is_a_match = true
end

Note this does not ensure that the string contains a word that is an element of $TARGET_PARTLIST. For example, if $TARGET_PARTLIST contains the word "able", that string will be found in the string, "Are you comfortable?". If you only want to match words, you could do the following.

found = $TARGET_PARTLIST.find { |target_pn| pdf_content_string[/\b#{target_pn}\b/] }

Note this uses the method String#[].

\b is a word break in the regular expression, meaning that the first (last) character of the matched cannot be preceded (followed) by a word character (a letter, digit or underscore).

If speed is important it may be faster to use the following.

found = $TARGET_PARTLIST.find { |target_pn|
  pdf_content_string.include?(target_on) && pdf_content_string[/\b#{target_pn}\b/] }

Upvotes: 2

Alexis Purslane
Alexis Purslane

Reputation: 1390

In this case, all your syntax is correct, you've just got a logic error. While target_pn is defined (as a parameter) inside the block passed to any?, it is not defined in the block of the if statement because the scope of the any?-block ends with the closing curly brace, and target_pn is not available outside its scope. A correct (and more idiomatic) version of your code would look like this:

self.is_a_match = $TARGET_PARTLIST.any? do |target_pn| 
  included = pdf_content_string.include? target_pn
  self.last_match_code = target_pn if included
  included
end

Alternately, as jvillian so kindly suggests, one could turn the string into an array of words, then do an intersection and see if the resulting set is nonempty. Like this:

self.is_a_match = !($TARGET_PARTLIST & 
                    pdf_content_string.gsub(/[^A-Za-z ]/,"")
                                      .split).empty?

Unfortunately, this approach loses self.last_match_code. As a note, pointed out by Sergio, if you're dealing with non-English languages, the above regex will have to be changed.

Hope that helps!

Upvotes: 3

Related Questions