Cjoerg
Cjoerg

Reputation: 1325

How to speed up scanning a big string with a big regex?

I have big string:

string = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec nec neque..."
puts string.size # => 54555999

I also have a big regex:

regex = /Lorem|ipsum|dolor|sit|amet|consectetur|adipiscing|elit|Donec|nec|neque|facilisis|nulla|rhoncus|accumsan|non|in|arcu|Interdum|et|malesuada|fames|ac|ante|primis|faucibus|Pellentesque|habitant|morbi|tristique|senectus|netus|turpis|egestas|at|ut|metus|convallis|fringilla|Nullam|volutpat|risus|sodales|elementum|Fusce|vitae|dignissim|tortor|Vivamus|interdum|dapibus|leo|sed|Quisque|luctus|dui|ligula|consequat|augue|congue|a|Vestibulum|id|cursus|odio|Maecenas|libero|diam|placerat|Proin|sapien|gravida|Cras|eleifend|nisl|rutrum|lectus|Curabitur|auctor|urna|tellus|tincidunt|erat|eget|vulputate|nibh|tempor|Ut|vehicula|nisi|velit|suscipit|nunc|tempus|vestibulum|viverra|Duis|sagittis|dictum|justo|hendrerit|massa|mollis|ultricies|lorem|imperdiet|mattis|pharetra|Aenean|lacus|purus|condimentum|Integer|sem|ullamcorper|feugiat|venenatis|quis|pellentesque|felis|finibus|porta|Nam|pulvinar|est|Morbi|ex|eros|commodo|Praesent|mauris|scelerisque|enim|aliquet|Etiam|Mauris|eu|bibendum|efficitur|magna|maximus|ornare|Phasellus|vel|blandit|sollicitudin|Suspendisse|Sed|quam|pretium|mi|semper|molestie|In|Nulla|Aliquam|euismod|orci|varius|hac|habitasse|platea|dictumst|iaculis|ultrices|Nunc|aliquam|fermentum|lacinia|lobortis|porttitor|laoreet|posuere|cubilia|Curae|facilisi|potenti|Cum|sociis|natoque|penatibus|magnis|dis|parturient|montes|nascetur|ridiculus|mus|Class|aptent|taciti|sociosqu|ad|litora|torquent|per|conubia|nostra|inceptos|himenaeos/

I now wish to scan the string with the regex:

puts Benchmark.measure { string.scan(regex) } # => 17.830000   0.380000  18.210000 ( 18.235952)

As you can see scanning takes 17.83 long seconds, and that is too much for my use case.

How can I speed this up?

The match and =~ methods won't do me any good, because I need the array of matches. I think that maybe I need to go outside Ruby to speed this up, but I have no idea how.

The real world regex is a lot of order IDs and the real world string is a lot of emails (subject and content) merged in to one string. The point of scanning the emails for mentions of the order IDs is to find out which order IDs customers have enquired about.

EDIT:

The reason why I would like to make an initial regex test is because I afterwards need to make a very big loop like this:

["order_id_1", "order_id_2"].each do |order_id|
  ["email body 1", "email body 2"].delete_if do |email_string|
    match = (email_string =~ Regexp.new(order_id)) != nil
    if match
      # do stuff
    end
    match
    end
  end
end

And with many order IDs and many emails this will become a very big loop unless I initially asses which order IDs have been corresponded about.

Upvotes: 3

Views: 1430

Answers (3)

Cary Swoveland
Cary Swoveland

Reputation: 110685

Suppose order_ids is an array of order ids, which I assume contain numbers and/or digits and case is not an issue. If email is a string containing one email,

email.scan(/\w+/) & order_ids

will give you all the order ids contained in that email. You can speed that up, however, by first converting order_ids to a set:

require 'set'
order_id_set = order_ids.to_set

email.scan(/\w+/).select { |w| order_id_set.include?(w) }

In any event, there is no advantage to stringing the emails together.

I suggested you divide the email into words using:

email.scan(/\w+/)

but you may want something more elaborate. Suppose an email were as follows:

email = "I am really annoyed that my order AB123 was shipped late, " +
  "that order CD456was poorly packed and I was overcharged for order EF789."

Let's look at two ways we might divide this into words:

esplit = email.split
  #=> ["I", "am", "really", "annoyed", "that", "my", "order", "AB123",
  #    "was", "shipped", "late,", "that", "order", "CD456was", "poorly",
  #    "packed", "and", "I", "was", "overcharged", "for", "order", "EF789."]

escan = email.scan(/\w+/)
  #=> ["I", "am", "really", "annoyed", "that", "my", "order", "AB123",
  #    "was", "shipped", "late", "that", "order", "CD456was", "poorly",
  #    "packed", "and", "I", "was", "overcharged", "for", "order", "EF789"]

You can see trouble is brewing (e.g., "CD456was") with both of these approaches. Now let's create a set of order id's:

require 'set'
order_id_set = %w{ AB123 CD456 EF789 GH012 }.to_set
  #=> #<Set: {"AB123", "CD456", "EF789", "GH012"}>

and then search the email for order ids:

esplit.select { |w| order_id_set.include?(w) }
  #=> ["AB123"]
escan.select  { |w| order_id_set.include?(w) }
  #=> ["AB123", "EF789"]

As you see, split (same as split(' ') and split(/\s+/)), which splits the line on whitespace, caught the first order id, but missed the other two, CD456 because the writer failed to put a space between that id and the word "was" and EF789, because it was looking for "EF789.".

scan(/\w+/), which splits out "words" (the \w looks for a-z, A-Z, 0-9 and _), did better, identifying "EF789" as an order id, because the period is a "non-word" character. It also missed "CD456", however, as "w", "a" and "s" are word characters, so it included them. (This is similar to your "1234" example.

What this means is the you need to craft a better regex. If, for example, all the order ids were like the ones in my example--two capital letters followed by three digits--you might write:

email.scan(/(?<![A-Z])[A-Z]{2}\d{3}(?!\d)/)
     .select { |w| order_id_set.include?(w) }
  #=> ["AB123", "CD456", "EF789"]

This matches two capital letters immediately preceded by anything other than a capital letter, followed by three digits, immediately followed by any character other than another digit. This was just an illustration of what is possible. You might want to post another question that is concerned exclusively with using regular expressions to identify possible order numbers in the emails. For that you would have to better describe what the order numbers look like, of course.

In any event, your regex would not contain #{order_id}, which you mentioned. The only way of doing that would be to check each email with a separate regex for each order id, which would be extraordinarily inefficient. Instead, you want to use the regex to identify possible order numbers and then check them against the set of order numbers. In choosing a regex you might have a tradeoff between the number of order ids that are missed and the number of non-order id strings that you waste time checking against the set of order ids.

Upvotes: 3

Kevin
Kevin

Reputation: 15974

Here is an algorithm that runs in O(n^2) time. This is much better than the performance of your regex, which is likely much worse than O(n^2).

The intuition behind this algorithm is that using strings and regexes is computationally expensive! Regexes are powerful tools when run over small strings or when run infrequently, but perform poorly when both the regex and string are complicated. However, in your particular case, you're not really interested in whether a string is formatted in a particular way. Instead, you want to find out which emails match which order IDs. We can do this directly, without the overhead of using regexes.

According to one of your comments, the data you receive isn't initially in the form of a long string and a long regex, but rather in two large lists. This is good, as using the lists makes it easier to use the algorithm. Since you're interested in knowing which order IDs match emails, we'll design our algorithm so that it produces an array containing every order ID that matched an email.

This is the algorthm:

  1. Create an array of matched order IDs. Initially, it will be empty.
  2. For each string in the array of emails,
    1. For each order ID in the array of order IDs, see if any of them match the email.
    2. If we find a match, add the order ID to the array of matched order IDs.
  3. The array of matched order IDs contains every order ID that matches an email.

If there are n emails and m order IDs, this algorithm takes O(nm) time. If n and m are about the same, then the algorithm runs in O(n^2) time.


This is some Ruby code implementing this algorithm:

emails = ['email1', 'email2', 'email3', #etc.
order_ids = ['order1', 'order2', 'order3', #etc.

# Create an array to hold the matched order IDs
matched_order_ids = []

# Perform a search for each email
emails.each |email| do
  order_ids.delete_if |order_id| do
    if email.include? order_id
      matched_order_ids.push order_id

      # tells delete_if to remove the order_id,
      # saving us time searching in future emails
      return true
    else
      # tells delete_if not to remove the order_id
      return false
    end
  end
end

# At this point, matched_order_ids contains every
# order ID that was matched in at least one of the
# emails

Upvotes: 0

Adrian
Adrian

Reputation: 15171

Don't use a regular expression. Instead, use a specialized algorithm suited to the specific string and pattern you want to match. If you absolutely need to use a regular expression, you might try to find a different engine than the one that Ruby uses natively (although I doubt that this will be very fruitful). If you still need better performance, write the algorithm in a faster/lower-level language like C, and then call that from ruby with a native extension or something.

Here are some resources to help with native extensions:

As for crafting a specialized algorithm, I probably can't help you there.

Upvotes: 1

Related Questions