Reputation: 773
I am having some text which may or may not contain a country name in it. for example:
' Nigeria: Hotspot Network LTD Rural Telephony Feasibility Study'
this is how I extract the country name from it. in my first attempt:
findcountry("Nigeria: Hotspot Network LTD Rural Telephony Feasibility Study")
def findCountry(stringText):
for country in pycountry.countries:
if country.name.lower() in stringText.lower():
return country.name
return None
unfortunately, it gives me the wrong output as [Niger]
whereas the correct one is Nigeria. Note Niger and Nigeria are two different existing countries in the world.
in second attempt:
def findCountry(stringText):
full_list =[]
for country in pycountry.countries:
if country.name.lower() in stringText.lower():
full_list.append(country)
if len(full_list) > 0:
return full_list
return None
I get ['Niger', 'Nigeria']
as output. but I can't find a way to get Nigeria as my final output. How to achieve this.
Note: here I know Nigeria is the correct answer but later one I will put it to the code to choose the final country name if present in the text and it should be having very high accuracy for detection.
Upvotes: 5
Views: 2981
Reputation: 4723
The problem here is in works for occurrence. So Niger is true for Nigeria. You can also change the placement for variables before and after in but that will solve for Nigeria but not for others. You can use ==
which will solve all the case.
def findCountry(stringText):
for country in pycountry.countries:
if country.name.lower() == stringText.lower():
return country.name
return None
Upvotes: 2
Reputation: 773
I got the correct answer like this:
def findCountry(stringText):
countries = sorted([country.name for country in pycountry.countries] , key=lambda x: -len(x))
for country in countries:
if country.lower() in stringText.lower():
return country
return None
following @Amandan solution in this question.
Upvotes: 0
Reputation: 520878
One regex approach would be to build an alternation containing all target countries to be found. Then, use re.findall
on the input text to find any possible matches:
regex = r'\b(?:' + '|'.join(pycountry.countries) + r')\b'
def findCountry(stringText):
countries = re.findall(regex, stringText, flags=re.IGNORECASE)
return countries
Upvotes: 2
Reputation: 198304
Always search for longest strings first; this will prevent the kind of error you encountered.
countries = sorted(pycountry.countries, key=lambda x: -len(x))
Upvotes: 8