aerin
aerin

Reputation: 22634

python regex - replace newline (\n) to something else

I'm trying to convert multiple continuous newline characters followed by a Capital Letter to "____" so that I can parse them.

For example,

i = "Inc\n\nContact"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i) 

In [25]: i
Out [25]: 'Inc____Contact'

This string works fine. I can parse them using ____ later.

However it doesn't work on this particular string.

i =  "(2 months)\n\nML"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)

Out [31]: '(2 months)____L'

It ate capital M. What am I missing here?

Upvotes: 3

Views: 16586

Answers (3)

Quinn
Quinn

Reputation: 4504

EDIT To replace multiple continuous newline characters (\n) to ____, this should do:

>>> import re
>>> i =  "(2 months)\n\nML"
>>> re.sub(r'(\n+)(?=[A-Z])', r'____', i)
'(2 months)____ML'

(?=[A-Z]) is to assert "newline characters followed by Capital Letter". REGEX DEMO.

Upvotes: 6

Sebastian Proske
Sebastian Proske

Reputation: 8413

Well let's take a look at your regex ([\n]+)([A-Z])+ - the first part ([\n]+) is fine, matching multiple occurences of a newline into one group (note - this wont match the carriage return \r). However the second part ([A-Z])+ leeds to your error it matches a single uppercase letter into a capturing group - multiple times, if there are multiple Uppercase letter, which will reset the group to the last matched uppercase letter, which is then used for the replace.

Try the following and see what happens

import re    
i = "Inc\n\nABRAXAS"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)

You could simply place the + inside the capturing group, so multiple uppercase letters are matched into it. You could also just leave it out, as it doesn't make a difference, how many of these uppercase letters follow.

import re    
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)([A-Z])', r"____\2", i)

If you want to replace any sequence of linebreaks, no matter what follows - drop the ([A-Z]) completely and try

import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)', r"____", i)

You could also use ([\r\n]+) as pattern, if you want to consider carriage returns

Upvotes: 1

Saleem
Saleem

Reputation: 8978

Try:

import re
p = re.compile(ur'[\r?\n]')
test_str = u"(2 months)\n\nML"
subst = u"_"

result = re.sub(p, subst, test_str)

It will reduce string to

(2 months)__ML

See Demo

Upvotes: 0

Related Questions