Reputation: 650
I have a list of strings (sentences) that might contain one or more Dutch city names. I also have a list of Dutch cities, and their various spellings. I am currently working in Python, but a solution in another language would also work.
What would be the best and most efficient way to retrieve a list of cities mentioned in the sentences?
What I do at the moment is loop through the sentence list, and then within that loop, loop through the cities list and one by one check if
place_name in sentence.lower()
, so I have:
for sentence in sentences:
for place_name in place_names:
if place_name in sentence.lower():
places[place_name] = places[place_name] + 1
Is this the most efficient way to do this? I also run into the problem that cities like "Ee" exist in Holland, and that words with "ee" in them are quite common. For now I solved this by just checking if place_name + ' ' in sentence.lower()
, but this is of course suboptimal and ugly, as it would also disregard sentences like "Huis in Amsterdam", since it doesn't end with a space, and it won't also work well with punctuation. I tried using regex, but this is of course way too slow. Would there be a better way to solve this particular problem, or to solve this problem in general? I am leaning somewhat to an NLP solution, but I also feel like that would be a massive overkill.
Upvotes: 1
Views: 6280
Reputation: 473853
You may look into Named Entity Recognition solutions in general. This can be done in nltk
as well but here is a sample in Spacy - cities would be marked with GPE
labels (GPE
stands for "Geopolitical Entity" like countries, states, cities etc):
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(u'Some company is looking at buying an Amsterdam startup for $1 billion')
for ent in doc.ents:
print(ent.text, ent.label_)
Prints:
Amsterdam GPE
$1 billion MONEY
Upvotes: 5