Reputation: 41
I'm using the following regex to match city:
[a-zA-Z]+(?:[ '-][a-zA-Z]+)*
The problem is it not only matches the city but also part of the street name.
How can I make it match only the city (such as Brooklyn and Columbia City)?
UPDATE:
The data is represented in 1 line of text (each address will be fed to regex engine separately):
2778 Ray Ridge Pkwy,
Brooklyn NY 1194-5954
1776 99th St,
Brooklyn NY 11994-1264
1776 99th St,
Columbia City OR 11994-1264
Upvotes: 0
Views: 992
Reputation: 626738
I suggest the following approach: match the words from the beginning of the string till the first occurrence of 2 uppercase letters followed with the ZIP (see the look-ahead (?=\s+[A-Z]{2}\s+\d{5}-\d{4})
below):
^[A-Za-z]+(?:[\s'-]+[A-Za-z]+)*(?=\s+[A-Z]{2}\s+\d+-\d+)
See demo
The regex:
^
- then starts looking from the beginning[A-Za-z]+
- matches a word(?:[\s'-]+[A-Za-z]+)*
- matches 0 or more words that...(?=\s+[A-Z]{2}\s+\d+-\d+)
- are right before a space + 2 uppercase letters, space, 1 or more digits, hyphen and 1 or more digits.If the ZIP (or whatever the numbers stand for) is optional, you may just rely on the 2 uppercase letters:
^[A-Za-z]+(?:[\s'-]+[A-Za-z]+)*(?=\s+[A-Z]{2}\b)
Note that \b
in \s+[A-Z]{2}\b
is a word boundary that will force a non-word (space or punctuation or even end of string) to appear after 2 uppercase letters.
Just do not forget to use double backslash in Java to escape regex special metacharacters.
Here is a Java code demo:
String s = "Brooklyn NY 1194-5954";
Pattern pattern = Pattern.compile("^[A-Za-z]+(?:[\\s'-]+[A-Za-z]+)*(?=\\s+[A-Z]{2}\\b)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(0));
}
Upvotes: 2
Reputation: 754
In case all your data is like in your example in the question, the pattern in your data is everything from the comma after the street to minimum 2 uppercase letters which represents the state.
This pattern matches the pattern as described and selects a group which should represent the city:
,\s+([a-zA-Z\s]*)[A-Z]{2,}?\s+
Upvotes: 0
Reputation: 41
OK.. I think I got it after hrs of tweaking and testing. May be helpful for someone else. This did the trick:
(?<=\n)[a-zA-Z]+(?:[ '-][a-z]+)* ?[A-Z]?[a-z]+
Upvotes: 1