Reputation: 1325
I have big string:
string = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec nec neque..."
puts string.size # => 54555999
I also have a big regex:
regex = /Lorem|ipsum|dolor|sit|amet|consectetur|adipiscing|elit|Donec|nec|neque|facilisis|nulla|rhoncus|accumsan|non|in|arcu|Interdum|et|malesuada|fames|ac|ante|primis|faucibus|Pellentesque|habitant|morbi|tristique|senectus|netus|turpis|egestas|at|ut|metus|convallis|fringilla|Nullam|volutpat|risus|sodales|elementum|Fusce|vitae|dignissim|tortor|Vivamus|interdum|dapibus|leo|sed|Quisque|luctus|dui|ligula|consequat|augue|congue|a|Vestibulum|id|cursus|odio|Maecenas|libero|diam|placerat|Proin|sapien|gravida|Cras|eleifend|nisl|rutrum|lectus|Curabitur|auctor|urna|tellus|tincidunt|erat|eget|vulputate|nibh|tempor|Ut|vehicula|nisi|velit|suscipit|nunc|tempus|vestibulum|viverra|Duis|sagittis|dictum|justo|hendrerit|massa|mollis|ultricies|lorem|imperdiet|mattis|pharetra|Aenean|lacus|purus|condimentum|Integer|sem|ullamcorper|feugiat|venenatis|quis|pellentesque|felis|finibus|porta|Nam|pulvinar|est|Morbi|ex|eros|commodo|Praesent|mauris|scelerisque|enim|aliquet|Etiam|Mauris|eu|bibendum|efficitur|magna|maximus|ornare|Phasellus|vel|blandit|sollicitudin|Suspendisse|Sed|quam|pretium|mi|semper|molestie|In|Nulla|Aliquam|euismod|orci|varius|hac|habitasse|platea|dictumst|iaculis|ultrices|Nunc|aliquam|fermentum|lacinia|lobortis|porttitor|laoreet|posuere|cubilia|Curae|facilisi|potenti|Cum|sociis|natoque|penatibus|magnis|dis|parturient|montes|nascetur|ridiculus|mus|Class|aptent|taciti|sociosqu|ad|litora|torquent|per|conubia|nostra|inceptos|himenaeos/
I now wish to scan
the string
with the regex
:
puts Benchmark.measure { string.scan(regex) } # => 17.830000 0.380000 18.210000 ( 18.235952)
As you can see scanning takes 17.83 long seconds, and that is too much for my use case.
How can I speed this up?
The match
and =~
methods won't do me any good, because I need the array of matches. I think that maybe I need to go outside Ruby to speed this up, but I have no idea how.
The real world regex is a lot of order IDs and the real world string is a lot of emails (subject and content) merged in to one string. The point of scanning the emails for mentions of the order IDs is to find out which order IDs customers have enquired about.
EDIT:
The reason why I would like to make an initial regex test is because I afterwards need to make a very big loop like this:
["order_id_1", "order_id_2"].each do |order_id|
["email body 1", "email body 2"].delete_if do |email_string|
match = (email_string =~ Regexp.new(order_id)) != nil
if match
# do stuff
end
match
end
end
end
And with many order IDs and many emails this will become a very big loop unless I initially asses which order IDs have been corresponded about.
Upvotes: 3
Views: 1430
Reputation: 110685
Suppose order_ids
is an array of order ids, which I assume contain numbers and/or digits and case is not an issue. If email
is a string containing one email,
email.scan(/\w+/) & order_ids
will give you all the order ids contained in that email. You can speed that up, however, by first converting order_ids
to a set:
require 'set'
order_id_set = order_ids.to_set
email.scan(/\w+/).select { |w| order_id_set.include?(w) }
In any event, there is no advantage to stringing the emails together.
I suggested you divide the email into words using:
email.scan(/\w+/)
but you may want something more elaborate. Suppose an email were as follows:
email = "I am really annoyed that my order AB123 was shipped late, " +
"that order CD456was poorly packed and I was overcharged for order EF789."
Let's look at two ways we might divide this into words:
esplit = email.split
#=> ["I", "am", "really", "annoyed", "that", "my", "order", "AB123",
# "was", "shipped", "late,", "that", "order", "CD456was", "poorly",
# "packed", "and", "I", "was", "overcharged", "for", "order", "EF789."]
escan = email.scan(/\w+/)
#=> ["I", "am", "really", "annoyed", "that", "my", "order", "AB123",
# "was", "shipped", "late", "that", "order", "CD456was", "poorly",
# "packed", "and", "I", "was", "overcharged", "for", "order", "EF789"]
You can see trouble is brewing (e.g., "CD456was"
) with both of these approaches. Now let's create a set of order id's:
require 'set'
order_id_set = %w{ AB123 CD456 EF789 GH012 }.to_set
#=> #<Set: {"AB123", "CD456", "EF789", "GH012"}>
and then search the email for order ids:
esplit.select { |w| order_id_set.include?(w) }
#=> ["AB123"]
escan.select { |w| order_id_set.include?(w) }
#=> ["AB123", "EF789"]
As you see, split
(same as split(' ')
and split(/\s+/)
), which splits the line on whitespace, caught the first order id, but missed the other two, CD456
because the writer failed to put a space between that id and the word "was"
and EF789
, because it was looking for "EF789."
.
scan(/\w+/)
, which splits out "words" (the \w
looks for a-z
, A-Z
, 0-9
and _
), did better, identifying "EF789"
as an order id, because the period is a "non-word" character. It also missed "CD456"
, however, as "w"
, "a"
and "s"
are word characters, so it included them. (This is similar to your "1234"
example.
What this means is the you need to craft a better regex. If, for example, all the order ids were like the ones in my example--two capital letters followed by three digits--you might write:
email.scan(/(?<![A-Z])[A-Z]{2}\d{3}(?!\d)/)
.select { |w| order_id_set.include?(w) }
#=> ["AB123", "CD456", "EF789"]
This matches two capital letters immediately preceded by anything other than a capital letter, followed by three digits, immediately followed by any character other than another digit. This was just an illustration of what is possible. You might want to post another question that is concerned exclusively with using regular expressions to identify possible order numbers in the emails. For that you would have to better describe what the order numbers look like, of course.
In any event, your regex would not contain #{order_id}
, which you mentioned. The only way of doing that would be to check each email with a separate regex for each order id, which would be extraordinarily inefficient. Instead, you want to use the regex to identify possible order numbers and then check them against the set of order numbers. In choosing a regex you might have a tradeoff between the number of order ids that are missed and the number of non-order id strings that you waste time checking against the set of order ids.
Upvotes: 3
Reputation: 15974
Here is an algorithm that runs in O(n^2)
time. This is much better than the performance of your regex, which is likely much worse than O(n^2)
.
The intuition behind this algorithm is that using strings and regexes is computationally expensive! Regexes are powerful tools when run over small strings or when run infrequently, but perform poorly when both the regex and string are complicated. However, in your particular case, you're not really interested in whether a string is formatted in a particular way. Instead, you want to find out which emails match which order IDs. We can do this directly, without the overhead of using regexes.
According to one of your comments, the data you receive isn't initially in the form of a long string and a long regex, but rather in two large lists. This is good, as using the lists makes it easier to use the algorithm. Since you're interested in knowing which order IDs match emails, we'll design our algorithm so that it produces an array containing every order ID that matched an email.
This is the algorthm:
If there are n
emails and m
order IDs, this algorithm takes O(nm)
time. If n
and m
are about the same, then the algorithm runs in O(n^2)
time.
This is some Ruby code implementing this algorithm:
emails = ['email1', 'email2', 'email3', #etc.
order_ids = ['order1', 'order2', 'order3', #etc.
# Create an array to hold the matched order IDs
matched_order_ids = []
# Perform a search for each email
emails.each |email| do
order_ids.delete_if |order_id| do
if email.include? order_id
matched_order_ids.push order_id
# tells delete_if to remove the order_id,
# saving us time searching in future emails
return true
else
# tells delete_if not to remove the order_id
return false
end
end
end
# At this point, matched_order_ids contains every
# order ID that was matched in at least one of the
# emails
Upvotes: 0
Reputation: 15171
Don't use a regular expression. Instead, use a specialized algorithm suited to the specific string and pattern you want to match. If you absolutely need to use a regular expression, you might try to find a different engine than the one that Ruby uses natively (although I doubt that this will be very fruitful). If you still need better performance, write the algorithm in a faster/lower-level language like C, and then call that from ruby with a native extension or something.
Here are some resources to help with native extensions:
As for crafting a specialized algorithm, I probably can't help you there.
Upvotes: 1