abhinav dwivedi
abhinav dwivedi

Reputation: 198

Regex statement to replace spaces with underscore between words starting with Capital Letter

With input like:

Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic.

I am expecting an output like:

Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.

A solution I've tried using positive lookbehind (using Python re package) is:

re.sub(r"(?<=\w)\s([A-Z])", r"_\1", above_string)

But here, because of \w, I get an output:

Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is_Novak_Djokovic.

Naturally, I can't make it work using r"(?<=[A-Z]\w*)\s([A-Z])", because

error: look-behind requires fixed-width pattern

I have to apply this regex on huge number of (and much diverse) articles so I can't afford any loop or a str.replace bruteforce. I was wondering if anyone could please come with with an efficient solution.

Upvotes: 1

Views: 129

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626893

If you do not care about all Unicode uppercase letters, you can use

import re
above_string = "Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic."
print( re.sub(r"\b([A-Z]\w*)\s+(?=[A-Z])", r"\1_", above_string) )
# => Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.

See the Python demo. See the regex demo. Details:

  • \b - a word boundary
  • ([A-Z]\w*) - Group 1 (\1): an uppercase letter and zero or more word chars
  • \s+ - one or more whitespaces
  • (?=[A-Z]) - a positive lookahead that matches a location immediately followed with an uppercase letter.

If you need to support all Unicode letters, it is advisable to pip install regex and use

import regex
above_string = "Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic."
print( regex.sub(r"\b(\p{Lu}\w*)\s+(?=\p{Lu})", r"\1_", above_string) )
# => Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.

See this Python demo. Here, \p{Lu} matches any Unicode uppercase letter.

Upvotes: 1

Related Questions