stratis
stratis

Reputation: 8042

Conditional regex in Ruby

I've got the following string:

'USD 100'

Based on this post I'm trying to capture 100 if USD is contained in the string or the individual (currency) characters if USD is not contained in the string.

For example:

'USD 100' # => '100'
'YEN 300' # => ['Y', 'E', 'N']

So far I've got up to this but it's not working:

https://rubular.com/r/cK8Hn2mzrheHXZ

Interestingly if I place the USD after the amount it seems to work. Ideally I'd like to have the same behaviour regardless of the position of the currency characters.

Upvotes: 5

Views: 831

Answers (5)

The fourth bird
The fourth bird

Reputation: 163267

Anatomy of your pattern

(?=.*(USD))(?(1)\d+|[a-zA-Z])
|    |     | |  |   |_______
|    |     | |  |   Else match a single char a-zA-Z
|    |     | |  |   
|    |     | |  |__
|    |     | |  If group 1 exists, match 1+ digits
|    |     | |
|    |     | |__
|    |     | Test for group 1
|    |     |_________________
|    |     If Clause
|    |___
|    Capture group 1
|__________
Positive lookahead

About the pattern you tried

The positive lookahead is not anchored and will be tried on each position. It will continue the match if it returns true, else the match stops and the engine will move to the next position.

Why does the pattern not match?

On the first position the lookahead is true as it can find USD on the right. It tries to match 1+ digits, but the first char is U which it can not match.

USD 100
⎸
First position

From the second position till the end, the lookahead is false because it can not find USD on the right.

USD 100
 ⎸
Second position   
  • Eventually, the if clause is only tried once, where it could not match 1+ digits. The else clause is never tried and overall there is no match.

  • For the YEN 300 part, the if clause is never tried as the lookahead will never find USD at the right and overall there is no match.

Interesting resources about conditionals can be for example found at rexegg.com and regular-expressions.info


If you want the separate matches, you might use:

\bUSD \K\d+|[A-Z](?=[A-Z]* \d+\b)

Explanation

  • \bUSD Match USD and a space
  • \K\d+ Forget what is matched using \K and match 1+ digits
  • | Or
  • [A-Z] Match a char A-Z
  • (?=[A-Z]* \d+\b) Assert what is on the right is optional chars A-Z and 1+ digits

regex demo

Or using capturing groups:

\bUSD \K(\d+)|([A-Z])(?=[A-Z]* \d+\b)

Regex demo

Upvotes: 3

stratis
stratis

Reputation: 8042

TLDR;

An excellent working solution can be found in Wiktor's answer and the rest of the posts.

Long answer:

Since I wasn't perfectly satisfied with Wiktor's explanation of why my solution wasn't working, I decided to dig into it a bit more myself and this is my take on it:

Given the string USD 100, the following regex

(?=.*(USD))(?(1)\d+|[a-zA-Z])

simply won't work. The juice of this whole thing is to figure out why. It turns out that using a lookahead (?=.*(USD)) with a capture group, implicitly suggests that the position of USD (if any is found) is followed by some pattern (defined inside the conditional ((?(1)\d+|[a-zA-Z])) which in this case yields nothing since there's nothing before USD.

If we break it down in steps here's an outline of what -I think- is happening:

  1. The pointer is set at the very beginning. The lookahead (?=.*(USD)) is parsed and executed.
  2. USD is found but since the expression is a lookahead the pointer remains at the beginning of the string and is not consumed.
  3. The conditional ((?(1)\d+|[a-zA-Z])) is parsed and executed.
  4. Group 1 is set (since USD has been found) however \d+ fails since the pointer searches from the beginning of the string to the beginning of the string which turns out is the furthest point we can search when using a lookahead! After all that's exactly why it's called a lookahead: The searching has to happen across a range which stops just before this one starts.
  5. Since no digits nor anything is found before USD, the regex returns no results. And as Wiktor correctly pointed out:

the second alternative pattern will never be tried, because you required USD to be present in the string for a match to occur.

which basically says that since USD is always present in the string, the system would never jump to the "else" statement even if something was eventually found before USD.

As a counter example if the same regex is tested on this string, it will work:

'YEN USD 100'

Hope this helps someone in the future.

Upvotes: -1

Cary Swoveland
Cary Swoveland

Reputation: 110675

I suggest the information desired be extracted as follows.

R = /\b([A-Z]{3}) +(\d+)\b/

def doit(str)
  str.scan(R).each_with_object({}) do |(cc,val),h|
    h[cc] = (cc == 'USD') ? val : cc.split('')
  end
end

doit 'USD 100'
  #=> {"USD"=>"100"} 
doit 'YEN 300'
  #=> {"YEN"=>["Y", "E", "N"]} 
doit 'I had USD 6000 to spend'
  #=> {"USD"=>"6000"} 
doit 'I had YEN 25779 to spend'
  #=> {"YEN"=>["Y", "E", "N"]} 
doit 'I had USD 60 and CDN 80 to spend'
  #=> {"USD"=>"60", "CDN"=>["C", "D", "N"]} 
doit 'USD -100'
  #=> {} 
doit 'YENS 4000'
  #=> {} 

Regex demo

Ruby's regex engine performs the following operations.

\b          : assert a word boundary
([A-Z]{3})  : match 3 uppercase letters in capture group 1
\ +         : match 1+ spaces
(\d+)       : match 3 digits in capture group 2
\b          : assert a word boundary

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626747

Your regex (?=.*(USD))(?(1)\d+|[a-zA-Z]) does not work because

  • (?=.*(USD)) - a positive lookahead, triggered at every location inside a string (if scan is used) that matches USD substring after any 0 or more chars other than line break chars as many as possible (it means, there will only be a match if there is USD somewhere on a line)
  • (?(1)\d+|[a-zA-Z]) - a conditional construct that matches 1+ digits if Group 1 matched (if there is USD), or, an ASCII letter will be tried. However, the second alternative pattern will never be tried, because you required USD to be present in the string for a match to occur.

Look at the USD 100 regex debugger, it shows exactly what happens when the (?=.*(USD))(?(1)\d+|[a-zA-Z]) regex tries to find a match:

  • Step 1 to 22: The lookahead pattern is tried first. The point here is that the match will fail immediately if the positive lookahead pattern does not find a match. In this case, USD is found at the start of the string (since the first time the pattern is tried, the regex index is at the string start position). The lookahead found a match.
  • Step 23-25: since a lookahead is a non-consuming pattern, the regex index is still at the string start position. The lookahead says "go-ahead", and the conditional construct is entered. (?(1) condition is met, Group 1, USD, was matched. So, the first, then, part is triggered. \d+ does not find any digits, since there is U letter at the start. The regex match fails at the string start position, but there are more positions in the string to test since there is no \A nor ^ anchor that would only let a match to occur if the match is found at the start of the string/line.
  • Step 26: The regex engine index is advanced one char to the right, now, it is right before the letter S.
  • Step 27-40: The regex engine wants to find 0+ chars and then USD immediately to the right of the current location, but fails (U is already "behind" the index).
  • Then, the execution is just the same as described above: the regex fails to match USD anywhere to the right of the current location and eventually fails.

If the USD is somewhere to the right of 100, then you'd get a match.

So, the lookahead does not set any search range, it simply allows matching the rest of the patterns (if its pattern matches) or not (if its pattern is not found).

You may use

.scan(/^USD.*?\K(\d+)|([a-zA-Z])/).flatten.compact

Pattern details

  • ^USD.*?\K(\d+) - either USD at the start of the string, then any 0 or more chars other than line break chars as few as possible, and then the text matched is dropped and 1+ digits are captured into Group 1
  • | - or
  • ([a-zA-Z]) - any ASCII letter captured into Group 2.

See Ruby demo:

p "USD 100".scan(/^USD.*?\K(\d+)|([a-zA-Z])/).flatten.compact
# => ["100"]
p "YEN 100".scan(/^USD.*?\K(\d+)|([a-zA-Z])/).flatten.compact
# => ["Y", "E", "N"]

Upvotes: 3

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521053

The following pattern seems to work:

\b(?:USD (\d+)|(?!USD\b)(\w+) \d+)\b

This works with caveat that it just has a single capture group for the non USD currency symbol. One part of the regex might merit explanation:

(?!USD\b)(\w+)

This uses a negative lookahead to assert that the currency symbol is not USD. If so, then it captures that currency symbol.

Upvotes: 1

Related Questions