lxyu
lxyu

Reputation: 2839

string convert with python re

I get a string line:

>>> line = "  abc\n  def\n\n  ghi\n  jkl"
>>> print line
  abc
  def

  ghi
  jkl

and I want to convert it to "abcdef\n\n ghijkl", like:

>>> print "  abcdef\n\n  ghijkl"
  abcdef

  ghijkl

I tried python re module, and write something like this:

re.sub('(?P<word1>[^\n\s])\n\s*(?P<word2>[^\n\s])', '\g<word1>\g<word2>', line)

but I get this:

>>> re.sub('(?P<word1>[^\n\s])\n\s*(?P<word2>[^\n\s])', '\g<word1>\g<word2>', line)
Out: '  abcdefghijkl'

It seems to me that the \n\s* part is also matching \n\n. Can any one point out where I get it wrong?

Upvotes: 2

Views: 663

Answers (3)

ekhumoro
ekhumoro

Reputation: 120618

You could simplify the regexp if you used \S, which matches any non-whitespace character:

>>> import re
>>> line = "  abc\n  def\n\n  ghi\n  jkl"
>>> print re.sub(r'(\S+)\n\s*(\S+)', r'\1\2', line)
  abcdef

  ghijkl

However, the reason why your own regexp is not working is because your <word1> and <word2> groups are only matching a single character (i.e. they're not using +). So with that simple correction, your regexp will produce the correct output:

>>> print re.sub(r'(?P<word1>[^\n\s]+)\n\s*(?P<word2>[^\n\s]+)', r'\g<word1>\g<word2>', line)
  abcdef

  ghijkl

Upvotes: 0

Brigand
Brigand

Reputation: 86240

Try this,

line = "  abc\n  def\n\n  ghi\n  jkl"
print re.sub(r'\n(?!\n)\s*', '', line)

It gives,

abcdef
ghijkl

It says, "Replace a new line, followed by a space that is NOT a new line with nothing."

UPDATE: Here's a better version

>>>  re.sub(r'([^\n])\n(?!\n)\s*', r'\1', line)
'  abcdef\n\n  ghijkl'

It gives exactly what you said in the first post.

Upvotes: 0

Tim Pietzcker
Tim Pietzcker

Reputation: 336198

\s matches space, \t, \n (and, depending on your regex engine) a few other whitespace characters.

So if you only want to replace single linebreaks + spaces/tabs, you can use this:

newline = re.sub(r"(?<!\n)\n[ \t]*(?!\n)", "", line)

Explanation:

(?<!\n) # Assert that the previous character isn't a newline
\n      # Match a newline
[ \t]*  # Match any number of spaces/tabs
(?!\n)  # Assert that the next character isn't a newline

In Python:

>>> line = "  abc\n  def\n\n  ghi\n  jkl"
>>> newline = re.sub(r"(?<!\n)\n[ \t]*(?!\n)", "", line)
>>> print newline
  abcdef

  ghijkl

Upvotes: 4

Related Questions