Talib Daryabi
Talib Daryabi

Reputation: 773

How to extract country from a string in python

I am having some text which may or may not contain a country name in it. for example:

' Nigeria: Hotspot Network LTD Rural Telephony Feasibility Study'

this is how I extract the country name from it. in my first attempt:

findcountry("Nigeria: Hotspot Network LTD Rural Telephony Feasibility Study")

def findCountry(stringText):
    for country in pycountry.countries:
        if country.name.lower() in stringText.lower():
            return country.name
    return None

unfortunately, it gives me the wrong output as [Niger] whereas the correct one is Nigeria. Note Niger and Nigeria are two different existing countries in the world.

in second attempt:

def findCountry(stringText):
    full_list =[]
    for country in pycountry.countries:
        if country.name.lower() in stringText.lower():
            full_list.append(country)

    if len(full_list) > 0:
        return full_list

    return None

I get ['Niger', 'Nigeria'] as output. but I can't find a way to get Nigeria as my final output. How to achieve this.

Note: here I know Nigeria is the correct answer but later one I will put it to the code to choose the final country name if present in the text and it should be having very high accuracy for detection.

Upvotes: 5

Views: 2981

Answers (4)

moshfiqrony
moshfiqrony

Reputation: 4723

The problem here is in works for occurrence. So Niger is true for Nigeria. You can also change the placement for variables before and after in but that will solve for Nigeria but not for others. You can use == which will solve all the case.

def findCountry(stringText):
    for country in pycountry.countries:
        if country.name.lower() == stringText.lower():
            return country.name
    return None

Upvotes: 2

Talib Daryabi
Talib Daryabi

Reputation: 773

I got the correct answer like this:

def findCountry(stringText):
    countries = sorted([country.name for country in pycountry.countries] , key=lambda x: -len(x))
    for country in countries:
        if country.lower() in stringText.lower():
            return country
    return None

following @Amandan solution in this question.

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520878

One regex approach would be to build an alternation containing all target countries to be found. Then, use re.findall on the input text to find any possible matches:

regex = r'\b(?:' + '|'.join(pycountry.countries) + r')\b'

def findCountry(stringText):
    countries = re.findall(regex, stringText, flags=re.IGNORECASE)
    return countries

Upvotes: 2

Amadan
Amadan

Reputation: 198304

Always search for longest strings first; this will prevent the kind of error you encountered.

countries = sorted(pycountry.countries, key=lambda x: -len(x))

Upvotes: 8

Related Questions