Reputation: 15
I am working in python and there I have a list of countries that I would like to clean. Most countries are already written the way I want them to be. However, some country names have a one- or two-digit number attached or there is a text in brackets appended. Here's a sample of that list:
Argentina
Australia1
Bolivia (Plurinational State of)
China, Hong Kong Special Administrative Region
Côte d'Ivoire
Curaçao
Guinea-Bissau
Indonesia8
The part that I want to capture would look like this:
Argentina
Australia
Bolivia
China, Hong Kong Special Administrative Region
Côte d'Ivoire
Curaçao
Guinea-Bissau
Indonesia
The best solution that I was able to come up with is ^[a-zA-Z\s,ô'ç-]+
. However, this leaves country names that are followed by a text in parentheses with a trailing white space.
This means I would like to match the entire country name unless there is a digit or a white space followed by an open bracket, then I would like it to stop before the digit or the (
I know that I could probably solve this in two steps but I am also reasonably sure that it should be possible to define a pattern that can do it in one step. Since I am anyway in the process of getting familiar with regex, I thought this would be a nice thing to know.
Upvotes: 1
Views: 1055
Reputation: 110685
I suggest you simply convert the strings you don't want to empty strings, using the regular expression
\d+$| +\(.*\)
with the multiline flag set, causing ^
and $
to respectively match the beginning and end of a line, rather than the beginning and end of the string.
The expression matches one or more digits at the end of a line or one or more spaces followed by a string that is enclosed in matching parentheses.
Upvotes: 1
Reputation: 163362
The pattern can be written as matching any char except digits, parenthesis or whitespace chars. And that part by itself can be optionally repeated preceded by a space.
^[^\d\s()]+(?: [^\d\s()]+)*
^
Start of string[^\d\s()]+
Match 1+ times any char except a digit, whitespace char or parenthesis using a negated character class(?:
Non capture group to repeat as a whole part
[^\d\s()]+
Same match as above)*
Close the non capture group and optionally repeat itUpvotes: 1
Reputation: 51
I think you can try ^([^\d \n]| +[^\d (\n])+
or, if you can guarantee your input doesn't contain double-spaces, the slightly simpler ^([^\d \n]| [^\d(\n])+
(The ^
character inside []
excludes the following characters, see https://regexone.com/lesson/excluding_characters)
Technically, the regex I've given omits trailing spaces, but for your application it doesn't sound like that would be a bad thing.
Upvotes: 0
Reputation: 4426
You can test the regex here https://regex101.com/r/dupn18/1
This should do the trick
In [1]: import re
In [2]: pattern = re.compile(r'(.+(?=\d| \()|.+)')
In [3]: data = """Argentina
...: Australia1
...: Bolivia (Plurinational State of)
...: China, Hong Kong Special Administrative Region
...: Côte d'Ivoire
...: Curaçao
...: Guinea-Bissau
...: Indonesia8""".splitlines()
In [4]: [pattern.search(country).group() for country in data]
Out[4]:
['Argentina',
'Australia',
'Bolivia',
'China, Hong Kong Special Administrative Region',
"Côte d'Ivoire",
'Curaçao',
'Guinea-Bissau',
'Indonesia']
Upvotes: 0