Abdul
Abdul

Reputation: 1486

Regex to separate nationality from text

I have list of Nationalities of people against their entries, most of entries are properly given, but some of the entries are given as below; Proper ones are like below;

German
Iranian
Qatar

Improper are like below;

Possibly Ethiopian

Lebanon citizenship

DRC and Belgian nationalities

(1) Germany (b) Algeria

(a) Russian (b) Georgia

a) French, b) Tunisian

Indonesian (as at December 2003)

Iranian (Iranian citizenship)

Sudanese by birth

(1) Russian (2) USSR (until 1991)

Bahrain (citizenship revoked in January 2015)

United States of America. Also believed to hold Syrian nationality

Tunisian (dual nationality)

(1) German (2) Moroccan

1) Saudi Arabia 2) Qatar

a) Central African Republic b) South Sudan

Iranian national and US national/citizen

Kuwaiti citizenship withdrawn in 2002

I need to take out only bold text (Nationalities) from given text. Nationality can be of any country, these are just samples of some countries.

How would I apply regex or any type conditions which give results as expected. I have tried to check if text contains such characters then split them. for which I need to create more that 20 conditions and which also not good approach to do this.

List<string> listOfNationalities = listOfNationalities;

List<string> multiple new List<string>();
foreach (var nationality in listOfNationalities)
{
    if(nationality.Contains("(1)"))
    {
        string[] nat = nationality.Split(')'); 
        foreach (var item in nat)
        {
            multiple.Add(item);
        }
    }
}

Upvotes: 0

Views: 633

Answers (2)

Pedro Fernandes
Pedro Fernandes

Reputation: 376

If the nationalities is provided by a fixed list of available options. You can do the following:

List<string> listOfNationalities = listOfNationalities;

List<string> validNationalities = new List<string>();
validNationalities.Add("Brazilian");
validNationalities.Add("Japanese");
validNationalities.Add("...");

List<string> multiple = listOfNationalities.Where(n => validNationalities.Contains(n));

or even simpler:

string listOfNationalities = string.Join("|",listOfNationalities);

List<string> validNationalities = new List<string>();
validNationalities.Add("Brazilian");
validNationalities.Add("Japanese");
validNationalities.Add("...");

List<string> multiple = validNationalities.Where(n => listOfNationalities.Contains(n));

In this way, you will get the two nationalities given.

Upvotes: 2

41686d6564
41686d6564

Reputation: 19641

If you already have a list of valid nationalities, and if the nationalities don't include special characters, you can use something like the following to create the regex pattern at runtime:

public string NationalitiesPattern;

public string GetNationalitiesPattern()
{
    List<string> listOfNationalities = // All valid nationalities.
    string joinedNationalities = string.Join("|", listOfNationalities);
    return $@"\b(?:{joinedNationalities})\b";       // "\b(?:German|Iranian|Qatar|etc)\b"
}

And then you can use it like this:

if (string.IsNullOrEmpty(NationalitiesPattern))
    NationalitiesPattern = GetNationalitiesPattern();

MatchCollection matches = Regex.Matches(inputString, NationalitiesPattern);
foreach (Match m in matches)
    Console.WriteLine(m.Value);

Upvotes: 0

Related Questions