bers
bers

Reputation: 5771

How to complete remove a whole line with multiline regular expressions?

I want to remove all lines that include a b in this multiline string:

aba\n
aaa\n
aba\n
aaa\n
aba[\n\n - optional]

Note the file is not necessarily terminated by a newline character, or may have extra line breaks at the end that I want to keep.

This is the expected output:

aaa\n
aaa[\n\n - as in the input file]

This is what I have tried:

import re
String = "aba\naaa\naba\naaa\naba"
print(String)
print(re.sub(".*b.*", "", String))  # this one leaves three empty lines
print(re.sub(".*b.*\n", "", String))  # this one misses the last line
print(re.sub("\n.*b.*", "", String))  # this one misses the first line
print(re.sub(".*b.*\n?", "", String))  # this one leaves an empty last line
print(re.sub("\n?.*b.*", "", String))  # this one leaves an empty first line
print(re.sub("\n?.*b.*\n?", "", String))  # this one joins the two remaining lines

I have also tried out flags=re.M and various look-aheads and -behinds, but the main question seems to be: how can I remove either the first or the last occurrence of \n in a matching string, depending on which on exists - but not both, if both do exist?

Upvotes: 3

Views: 3081

Answers (2)

Alain T.
Alain T.

Reputation: 42143

There are three cases to take into account in your re.sub() call to remove lines with a b in them:

  1. patterns followed by an end of line character (eol)
  2. the last line in the text (without a trailing eol)
  3. when there is only one line with no trailing eol

In that second case, you want to remove the preceding eol character to avoid creating an empty line. The third case will produce an empty string if there is a "b".

Regular expressions' greed will introduce a fourth case because there can't be any pattern overlaps. If your last line contains a "b" and the line before that also contained a "b", case #1 will have consumed the eol character on the previous line so it won't be eligible to detect the pattern on the last line (i.e eol followed by the pattern at the end of text). This can be addressed by clearing (case#1) consecutive matching lines as a group and including the last line as an optional component of that group. Whatever this leaves out will be trailing lines (case#2) where you want to remove the preceding eol rather than the following one.

In order to manage repetition of the line pattern .*b.* you will need to assemble your search pattern from two parts: The line pattern and the list pattern that uses it multiple times. Since we're already deep in regular expressions, why not use re.sub() to do that as well.

import re

LinePattern = "(.*b.*)"
ListPattern = "(Line\n)+(Line$)?|(\nLine$)|(^Line$)" # Case1|Case2|Case3
Pattern     = re.sub("Line",LinePattern,ListPattern)

String  = "aba\naaa\naba\naaa\naba"
cleaned = re.sub(Pattern,"",String)

Note: This technique would also work with a different separation character (e.g. comma instead of eol) but the character needs to be excluded from the line pattern (e.g. ([^,]*b[^,]*) )

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626699

You may use a regex or a non-regex approach:

import re
s = "aba\naaa\naba\naaa\naba"
print( "\n".join([st for st in s.splitlines() if 'b' not in st]) )
print( re.sub(r'^[^b\r\n]*b.*[\r\n]*', '', s, flags=re.M).strip() )

See the Python demo.

Non-regex approach, "\n".join([st for st in s.splitlines() if 'b' in st]), splits the string with line breaks, filters out all lines not having b, and then joins the lines back.

The regex approach involves the pattern like r'^[^b\r\n]b.*[\r\n]*':

  • ^ - start of a line
  • [^b\r\n]* - any 0 or more chars other than CR, LF and b
  • b - a b char
  • .* - any 0+ chars other than line break chars
  • [\r\n]* - 0+ CR or LF chars.

Note you need to use .strip() to get rid of the unwanted whitespace at the start/end of the string after this.

A single regex solution is too cumbersome, I would not advise to use it in real life:

rx = r'(?:{0}(?:\n|$))+|(?:\n|^){0}'.format(r'[^b\n]*b.*')
print( re.sub(rx, '', s) )

See Python demo.

The pattern will look like (?:[^b\n]*b.*(?:\n|$))+|(?:\n|^)[^b\n]*b.* and it will match

  • (?:[^b\n]*b.*(?:\n|$))+ - 1 or more repetitions of
    • [^b\n]* - any 0+ chars other than b and a newline
    • b.* - b and the rest of the line (.* matches any 0+ chars other than a newline)
    • (?:\n|$) - a newline or end of string
  • | - or
    • (?:\n|^) - a newline or start of string
    • [^b\n]*b.* - a line with at least one b on it

Upvotes: 3

Related Questions