Named Entity Recognition (regex for places)

Question

How would you go about a regex that detects places of the following formats:

Word+, Word+, Word+

In a nutshell I want the regex to match a city name followed by a comma, followed by a state name followed by a comma, followed by the country name followed by a comma. Where the city name, state name and country name can contain multiple words separated by spaces or just one word.

Here's my failed attempt at it:

r'([A-Z][a-z]+ ?)+?, ([A-Z][a-z]+ ?)+?, ([A-Z][a-z]+ ?)+?'

It can detect places like:

Hyderabad, Andhra Pradesh, India

But not places like:

Bangalore, Karnataka, India
New York City, New York, United States

Casimir et Hippolyte · Accepted Answer

If you need to obtain the city, country, state in separate capturing groups, you can use:

r'(?i)([a-z]+(?: [a-z]+)*), ([a-z]+(?: [a-z]+)*), ([a-z]+(?: [a-z]+)*)'

Otherwise, this one match the substring format:

r'(?i)[a-z]+(?: [a-z]+)*(?:, [a-z]+(?: [a-z]+)*){2}'

If you need to have a capital letter at the start of each words (but keep in mind that all the city names don't have capital letter at the begining of each word, and that each word can be separated with a dash, example: Boulogne-sur-Mer, Rouperroux-le-Coquet or Jouy-en-Josas), you can adapt the two patterns replacing [a-z]+ with [A-Z][a-z]* and removing the modifier (?i)

A more realistic pattern can be:

r'([A-Z][a-z]*(?:[ '-][A-Za-z]+)*), ([A-Z][a-z]*(?:[ '-][A-Za-z]+)*), ([A-Z][a-z]*(?:[ '-][A-Za-z]+)*)

That can be improve (for example, this doesn't handle accentued letters).

Named Entity Recognition (regex for places)

Answers (1)

Related Questions