Noel Austin
Noel Austin

Reputation: 53

Regex to extract street from full address but leaving out optional directional component

I'm trying to use .NET Regex to extract the street portion of a full address.

Given these addresses:
2565 W Field Stream Drive
2565 Field St
2565 2nd Street
2001 Easterman Road

I want these results:
Field Stream Drive
Field St
2nd Street
Easterman Road

I've come up with this "(?<=(^\d+\s[NSEW]{1}\s)).*(?=$)" but it doesn't return the street if the directional element is missing.

Upvotes: 1

Views: 97

Answers (2)

Cary Swoveland
Cary Swoveland

Reputation: 110675

Wiktor has explained the problem with your regular expression. You could use the following expression to spear out the street names from the four examples you gave, but there's no guarantee it will work with other addresses (such as "102 Broadway" or "221B Baker St").

(?i)(?<=\d +|[NEWS] +)(?:[^NEWS]|[NEWS](?! )).*

.Net Demo

The regex engine performs the following operations.

(?i)         case-indifferent
(?<=         begin positive lookbehind
  \d +       match a digit then 1+ spaces
  |          or
  [NEWS] +   match 'N', 'E', 'W' or 'S', then 1+ spaces
)            end positive lookbehind
(?:          begin non-capture group
  [^NEWS]    match any character or than 'N', 'E', 'W', 'S'
  |          or
  [NEWS]     match 'N', 'E', 'W' or 'S'
  (?! )      not followed by a space (negative lookahead)
)            end non-capture group
.*           match 0+ characters to the end of the line

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626758

The problem is that the lookbehind pattern is executed at each location in the string, and it returns true once its pattern is found on the way from left to right. Thus, you can't just make [WNSE]\s+ optional in the lookbehind (like (?<=^\d+\s+(?:[WNSE]\s+)?).+), it will match immediately before even checking the optional pattern.

The not-so-efficient, but a .NET solution returning just the match value, will be

(?<=^\d+\s+[WNSE]\s+|^\d+(?!\s+[WNSE]\s)\s+).+

The first alternative in the lookbehind will match the location that is preceded with 1+ digits, 1+ whitespaces, W, N, S or E and then 1+ whitespaces, and the second one will match the 1+ digits + 1+ whitespaces at the start of the string that are not followed with W, N, S or E and a whitespace.

See the regex demo.

However, a much simpler solution is to use a capturing group:

^\d+\s+(?:[WNSE]\s+)?(.+)

See the regex demo. Here, the optional part will be tried at least once, and the .+ will only match what is after the N, S, E or W if present.

Upvotes: 1

Related Questions