Secco
Secco

Reputation: 15

Regex - Match a string up to a digit or a specific string

I am working in python and there I have a list of countries that I would like to clean. Most countries are already written the way I want them to be. However, some country names have a one- or two-digit number attached or there is a text in brackets appended. Here's a sample of that list:

Argentina
Australia1
Bolivia (Plurinational State of)
China, Hong Kong Special Administrative Region
Côte d'Ivoire
Curaçao
Guinea-Bissau
Indonesia8

The part that I want to capture would look like this:

Argentina
Australia
Bolivia
China, Hong Kong Special Administrative Region
Côte d'Ivoire
Curaçao
Guinea-Bissau
Indonesia

The best solution that I was able to come up with is ^[a-zA-Z\s,ô'ç-]+. However, this leaves country names that are followed by a text in parentheses with a trailing white space.

This means I would like to match the entire country name unless there is a digit or a white space followed by an open bracket, then I would like it to stop before the digit or the (

I know that I could probably solve this in two steps but I am also reasonably sure that it should be possible to define a pattern that can do it in one step. Since I am anyway in the process of getting familiar with regex, I thought this would be a nice thing to know.

Upvotes: 1

Views: 1055

Answers (4)

Cary Swoveland
Cary Swoveland

Reputation: 110685

I suggest you simply convert the strings you don't want to empty strings, using the regular expression

\d+$| +\(.*\)

with the multiline flag set, causing ^ and $ to respectively match the beginning and end of a line, rather than the beginning and end of the string.

Demo

The expression matches one or more digits at the end of a line or one or more spaces followed by a string that is enclosed in matching parentheses.

Upvotes: 1

The fourth bird
The fourth bird

Reputation: 163362

The pattern can be written as matching any char except digits, parenthesis or whitespace chars. And that part by itself can be optionally repeated preceded by a space.

^[^\d\s()]+(?: [^\d\s()]+)*
  • ^ Start of string
  • [^\d\s()]+ Match 1+ times any char except a digit, whitespace char or parenthesis using a negated character class
  • (?: Non capture group to repeat as a whole part
    • [^\d\s()]+ Same match as above
  • )* Close the non capture group and optionally repeat it

Regex demo

Upvotes: 1

Tiankai Liu
Tiankai Liu

Reputation: 51

I think you can try ^([^\d \n]| +[^\d (\n])+ or, if you can guarantee your input doesn't contain double-spaces, the slightly simpler ^([^\d \n]| [^\d(\n])+ (The ^ character inside [] excludes the following characters, see https://regexone.com/lesson/excluding_characters)

Technically, the regex I've given omits trailing spaces, but for your application it doesn't sound like that would be a bad thing.

Upvotes: 0

Ron Serruya
Ron Serruya

Reputation: 4426

You can test the regex here https://regex101.com/r/dupn18/1
This should do the trick

In [1]: import re

In [2]: pattern = re.compile(r'(.+(?=\d| \()|.+)')

In [3]: data = """Argentina
   ...: Australia1
   ...: Bolivia (Plurinational State of)
   ...: China, Hong Kong Special Administrative Region
   ...: Côte d'Ivoire
   ...: Curaçao
   ...: Guinea-Bissau
   ...: Indonesia8""".splitlines()

In [4]: [pattern.search(country).group() for country in data]
Out[4]:
['Argentina',
 'Australia',
 'Bolivia',
 'China, Hong Kong Special Administrative Region',
 "Côte d'Ivoire",
 'Curaçao',
 'Guinea-Bissau',
 'Indonesia']

Upvotes: 0

Related Questions