konewka
konewka

Reputation: 650

Finding city names in string

I have a list of strings (sentences) that might contain one or more Dutch city names. I also have a list of Dutch cities, and their various spellings. I am currently working in Python, but a solution in another language would also work.

What would be the best and most efficient way to retrieve a list of cities mentioned in the sentences?

What I do at the moment is loop through the sentence list, and then within that loop, loop through the cities list and one by one check if place_name in sentence.lower(), so I have:

for sentence in sentences:
    for place_name in place_names:
        if place_name in sentence.lower():
            places[place_name] = places[place_name] + 1

Is this the most efficient way to do this? I also run into the problem that cities like "Ee" exist in Holland, and that words with "ee" in them are quite common. For now I solved this by just checking if place_name + ' ' in sentence.lower(), but this is of course suboptimal and ugly, as it would also disregard sentences like "Huis in Amsterdam", since it doesn't end with a space, and it won't also work well with punctuation. I tried using regex, but this is of course way too slow. Would there be a better way to solve this particular problem, or to solve this problem in general? I am leaning somewhat to an NLP solution, but I also feel like that would be a massive overkill.

Upvotes: 1

Views: 6280

Answers (1)

alecxe
alecxe

Reputation: 473853

You may look into Named Entity Recognition solutions in general. This can be done in nltk as well but here is a sample in Spacy - cities would be marked with GPE labels (GPE stands for "Geopolitical Entity" like countries, states, cities etc):

import spacy

nlp = spacy.load('en_core_web_lg')

doc = nlp(u'Some company is looking at buying an Amsterdam startup for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.label_)

Prints:

Amsterdam GPE
$1 billion MONEY

Upvotes: 5

Related Questions