rubyist
rubyist

Reputation: 3132

match regular expression in ruby

I have strings as below

201-Grandview-Dr_Early_TX_76802 and /50-Washington-St

I am writing a regex to match both the strings.

((/^([0-9]+)-([^_]+)-([A-Za-z]{1,})$/ =~ data ) == 0)

But the above regex matches only 50-Washington-St and not the second one.

So what could be wrong in this regex.?

The updated list of the strings that should match:

201-Grandview-Dr_Early_TX_76802
/50-Washington-St
49220-Sunrose-Ln_Palm-Desert_CA_92260
201-Grandview-Dr_Early_TX_76802
50-Washington-St

Upvotes: 1

Views: 268

Answers (2)

Cary Swoveland
Cary Swoveland

Reputation: 110645

I would like to suggest a way of approaching problems like this one. The main take-away is that complex regular expressions can be constructed in the same way as other Ruby code: create small code modules that can be easily tested and then combine those modules.

Consider the first string that must match the regex.

s = "201-Grandview-Dr_Early_TX_76802"

As this string contains no characters that need to be escaped, we can create a regex that will exactly match this string by merely replacing the double-quotes with forward slashes and adding start-of-string (\A) and end-of-string (\z) anchors:

r = /\A201-Grandview-Dr_Early_TX_76802\z/
  #=> /\A201-Grandview-Dr_Early_TX_76802\z/ 
s =~ r
  #=> 0 

This is what we have:

/\A201-Grandview\-Dr_Early_TX_76802\z/
   ⬆︎street number
          ⬆︎street name
                  ⬆︎street name suffix
                      ⬆︎city
                           ⬆︎state
                                ⬆︎zip

Presumably the regex should match a string if and only if the string contains allowable values for each of these six fields and has the formatting shown between adjacent fields.

Let's begin by stipulating a separate regex for each of the six fields. Naturally, all of these regexes may need to be modified to suit requirements.

Street number

Typical street numbers might be "221", "221B", "221b". Let's say we might also have "A19" or "221BZ" but not "221-B". We might then write:

number = /[[:alnum:]]+/

(Search for "POSIX" in Regexp.)

Street name

I have assumed street names consist of a single word or multiple words separated by a single space, where each word is all lowercase except for the first letter, which is capitalized.

street = /[[:upper:]][[:lower:]]+(?:\s[[:upper:]][[:lower:]]+)*/

/[[:upper:]][[:lower:]]+ matches the first word, (?:\s[[:upper:]][[:lower:]])* matches a space followed by a capitalized word, repeated zero or more times ((?:...) is a non-capture group. The trailing * means repeat zero or more times.)

Street name suffix

I have assumed the street name suffix (e.g., 'Street', 'St.') is a single word, all lower case except the first character, which is upper case, optionally ending with a period:

suffix = /[[:upper:]][[:lower:]]+\.?/

City

I have assumed that names of cities has the same requirements as do names of streets:

city = street
  #=> /[[:upper:]][[:lower:]]+(?:\s[[:upper:]][[:lower:]]+)*/

State

States are given by two capital letters:

state = /[[:upper:]]{2}/

We could be more precise by writing:

state = Regexp.union %w| AL AK AZ ... |

but then we'd have to update it every time a territory became a new state or (possibly due to recent events) a state secedes from the union.

Zip code

Zip codes are five digits or nine digits with a dash or hyphen after the first four digits.

zip = /\d{5}(?:-\d{4})?/

Using

/\A201-Grandview-Dr_Early_TX_76802\z/

as our pattern, our overall regex is therefore the following:

r1 = /
     \A # match start of string 
     #{number}
     -
     #{street}
     -
     #{suffix}
     _
     #{city}
     _
     #{state}
     _
     #{zip}
     \z # match end of string
     /x # free-spacing regex definition mode
  #=> /
  #   \A # match start of string 
  #   /(?-mix:[[:alnum:]]+)
  #   -
  #   (?-mix:[[:upper:]][[:lower:]]+(?:\s[[:upper:]][[:lower:]]+)*)
  #   -
  #   (?-mix:[[:upper:]][[:lower:]]+\.?)
  #   _
  #   (?-mix:[[:upper:]][[:lower:]]+(?:\s[[:upper:]][[:lower:]]+)*)
  #   _
  #   (?-mix:[[:upper:]]{2})
  #   _
  #   (?-mix:\d{5}(?:-\d{4})?)
  #   \z # match start of string 
  /x 

Let's try it for the first string and variations thereof:

"201-Grandview-Dr_Early_TX_76802" =~ r1
   #=> 0
"221B-Grand View-Dr._El Paso_TX_76802-0000" =~ r1
   #=> 0
"2A0B1-Grandview-Dr_Early_ZZ_76802" =~ r1
   #=> 0
"201-GrandView-Dr_Early_TX_76802" =~ r1
   #=> nil
"201-Grandview-Dr_Early_TX_7680" =~ r1
   #=> nil
"201-Pi11ar-St_Early_TX_76802" =~ r1
   #=> nil
"I live at 201-Grandview-Dr_Early_TX_76802" =~ r1
   #=> nil
"201-😎mg Circle-Lane_Early_TX_76802" =~ r1
   #=> nil

Now consider the second example string for which there should be a match:

"/50-Washington-St"

We see the regex for this is simply

r2 = /
     \A
     \/
     #{number}
     -
     #{street}
     -
     #{suffix}
     \z
     /x
 #=> /
 #   \A
 #   \/
 #   (?-mix:[[:alnum:]]+)
 #   -
 #   (?-mix:[[:upper:]][[:lower:]]+(?:\s[[:upper:]][[:lower:]]+)*)
 #   -
 #   (?-mix:[[:upper:]][[:lower:]]+\.?)
 #   \z
 #   /x 

Let's try it.

 "/50-Washington-St" =~ r2
   #=> 0
 "50-Washington-St" =~ r2
   #=> nil
 "/50-Washington-St_Early" =~ r2
   #=> nil

So now our overall regex is simply

r = Regexp.union(r1,r2)
  #=> /(?x-mi:
  #   \A # match start of string 
  #   (?-mix:[[:alnum:]]+)
  #   -
  #   (?-mix:[[:upper:]][[:lower:]]+(?:\s[[:upper:]][[:lower:]]+)*)
  #   -
  #   (?-mix:[[:upper:]][[:lower:]]+\.?)
  #   _
  #   (?-mix:[[:upper:]][[:lower:]]+(?:\s[[:upper:]][[:lower:]]+)*)
  #   _
  #   (?-mix:[[:upper:]]{2})
  #   _
  #   (?-mix:\d{5}(?:-\d{4})?)
  #   \z # match end of string
  #   )|(?x-mi:
  #   \A
  #   \/
  #   (?-mix:[[:alnum:]]+)
  #   -
  #   (?-mix:[[:upper:]][[:lower:]]+(?:\s[[:upper:]][[:lower:]]+)*)
  #   -
  #   (?-mix:[[:upper:]][[:lower:]]+\.?)
  #   \z
  #   )/ 

"201-Grandview-Dr_Early_TX_76802" =~ r
  #=> 0
"/50-Washington-St" =~ r
  #=> 0

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

You may fix the regex like

/^\/?([0-9]+)-(.+?)-(\w+)$/

or to match the whole string (mind the ^ matches the line start and $ line end in Ruby regex):

/\A\/?([0-9]+)-(.+?)-(\w+)\z/

See the Rubular demo

Pattern details:

  • \A - string start
  • \/? - an optional /
  • ([0-9]+) - Group 1: one or more digits
  • - - a hyphen
  • (.+?) - Group 2: one or more chars other than linebreak chars
  • - - a hyphen
  • (\w+) - Group 3: one or more word ([A-Za-z0-9_]) characters
  • \z - end of string.

Upvotes: 5

Related Questions