Reputation: 198
With input like:
Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic.
I am expecting an output like:
Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.
A solution I've tried using positive lookbehind (using Python re
package) is:
re.sub(r"(?<=\w)\s([A-Z])", r"_\1", above_string)
But here, because of \w
, I get an output:
Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is_Novak_Djokovic.
Naturally, I can't make it work using r"(?<=[A-Z]\w*)\s([A-Z])"
, because
error: look-behind requires fixed-width pattern
I have to apply this regex on huge number of (and much diverse) articles so I can't afford any loop or a str.replace
bruteforce. I was wondering if anyone could please come with with an efficient solution.
Upvotes: 1
Views: 129
Reputation: 626893
If you do not care about all Unicode uppercase letters, you can use
import re
above_string = "Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic."
print( re.sub(r"\b([A-Z]\w*)\s+(?=[A-Z])", r"\1_", above_string) )
# => Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.
See the Python demo. See the regex demo. Details:
\b
- a word boundary([A-Z]\w*)
- Group 1 (\1
): an uppercase letter and zero or more word chars\s+
- one or more whitespaces(?=[A-Z])
- a positive lookahead that matches a location immediately followed with an uppercase letter.If you need to support all Unicode letters, it is advisable to pip install regex
and use
import regex
above_string = "Roger Federer is a tennis player. Rafael Nadal Parera is also a tennis player. Another legend player is Novak Djokovic."
print( regex.sub(r"\b(\p{Lu}\w*)\s+(?=\p{Lu})", r"\1_", above_string) )
# => Roger_Federer is a tennis player. Rafael_Nadal_Parera is also a tennis player. Another legend player is Novak_Djokovic.
See this Python demo. Here, \p{Lu}
matches any Unicode uppercase letter.
Upvotes: 1