Ron Pall
Ron Pall

Reputation: 41

Java: Match City Regex

I'm using the following regex to match city:

[a-zA-Z]+(?:[ '-][a-zA-Z]+)*

The problem is it not only matches the city but also part of the street name.

How can I make it match only the city (such as Brooklyn and Columbia City)?

UPDATE:

The data is represented in 1 line of text (each address will be fed to regex engine separately):

    2778 Ray Ridge Pkwy, 
Brooklyn NY 1194-5954



1776 99th St,
Brooklyn NY 11994-1264

 1776 99th St,
Columbia City  OR 11994-1264

Upvotes: 0

Views: 992

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

I suggest the following approach: match the words from the beginning of the string till the first occurrence of 2 uppercase letters followed with the ZIP (see the look-ahead (?=\s+[A-Z]{2}\s+\d{5}-\d{4}) below):

^[A-Za-z]+(?:[\s'-]+[A-Za-z]+)*(?=\s+[A-Z]{2}\s+\d+-\d+)

See demo

The regex:

  • ^ - then starts looking from the beginning
  • [A-Za-z]+ - matches a word
  • (?:[\s'-]+[A-Za-z]+)* - matches 0 or more words that...
  • (?=\s+[A-Z]{2}\s+\d+-\d+) - are right before a space + 2 uppercase letters, space, 1 or more digits, hyphen and 1 or more digits.

If the ZIP (or whatever the numbers stand for) is optional, you may just rely on the 2 uppercase letters:

^[A-Za-z]+(?:[\s'-]+[A-Za-z]+)*(?=\s+[A-Z]{2}\b)

Note that \b in \s+[A-Z]{2}\b is a word boundary that will force a non-word (space or punctuation or even end of string) to appear after 2 uppercase letters.

Just do not forget to use double backslash in Java to escape regex special metacharacters.

Here is a Java code demo:

String s = "Brooklyn NY 1194-5954";
Pattern pattern = Pattern.compile("^[A-Za-z]+(?:[\\s'-]+[A-Za-z]+)*(?=\\s+[A-Z]{2}\\b)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
    System.out.println(matcher.group(0)); 
} 

Upvotes: 2

ceth
ceth

Reputation: 754

In case all your data is like in your example in the question, the pattern in your data is everything from the comma after the street to minimum 2 uppercase letters which represents the state.

This pattern matches the pattern as described and selects a group which should represent the city:

,\s+([a-zA-Z\s]*)[A-Z]{2,}?\s+

Upvotes: 0

Ron Pall
Ron Pall

Reputation: 41

OK.. I think I got it after hrs of tweaking and testing. May be helpful for someone else. This did the trick:

(?<=\n)[a-zA-Z]+(?:[ '-][a-z]+)* ?[A-Z]?[a-z]+

Upvotes: 1

Related Questions