Anshu Dwibhashi
Anshu Dwibhashi

Reputation: 4675

Named Entity Recognition (regex for places)

How would you go about a regex that detects places of the following formats:

Word+, Word+, Word+

In a nutshell I want the regex to match a city name followed by a comma, followed by a state name followed by a comma, followed by the country name followed by a comma. Where the city name, state name and country name can contain multiple words separated by spaces or just one word.


Here's my failed attempt at it:

r'([A-Z][a-z]+ ?)+?, ([A-Z][a-z]+ ?)+?, ([A-Z][a-z]+ ?)+?'

It can detect places like:

But not places like:

Upvotes: 1

Views: 996

Answers (1)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89584

If you need to obtain the city, country, state in separate capturing groups, you can use:

r'(?i)([a-z]+(?: [a-z]+)*), ([a-z]+(?: [a-z]+)*), ([a-z]+(?: [a-z]+)*)'

Otherwise, this one match the substring format:

r'(?i)[a-z]+(?: [a-z]+)*(?:, [a-z]+(?: [a-z]+)*){2}'

If you need to have a capital letter at the start of each words (but keep in mind that all the city names don't have capital letter at the begining of each word, and that each word can be separated with a dash, example: Boulogne-sur-Mer, Rouperroux-le-Coquet or Jouy-en-Josas), you can adapt the two patterns replacing [a-z]+ with [A-Z][a-z]* and removing the modifier (?i)

A more realistic pattern can be:

r'([A-Z][a-z]*(?:[ '-][A-Za-z]+)*), ([A-Z][a-z]*(?:[ '-][A-Za-z]+)*), ([A-Z][a-z]*(?:[ '-][A-Za-z]+)*)

That can be improve (for example, this doesn't handle accentued letters).

Upvotes: 2

Related Questions