Jay
Jay

Reputation: 33

RegEx to remove carriage returns between <p> tags

I've stumped myself trying to figure out how to remove carriage returns that occur between <p> tags. (Technically I need to replace them with spaces, not remove them.)

Here's an example. I've used a dollar sign $ as a carriage return marker.

<p>Ac nec <strong>suspendisse est, dapibus.</strong> Nulla taciti curabitur enim hendrerit.$
Ante ornare phasellus tellus vivamus dictumst dolor aliquam imperdiet lectus.$
Nisl nullam sodales, tincidunt dictum dui eget, gravida anno. Montes convallis$
adipiscing, aenean hac litora. Ridiculus, ut consequat curae, amet. Nostra$
phasellus ridiculus class interdum justo. <em>Pharetra urna est hac</em> laoreet, magna.$
Porttitor purus purus, quis rutrum turpis. Montes netus nibh ornare potenti quam$
class. Natoque nec proin sapien augue curae, elementum.</p>

As the example shows, there can be other tags inbetween the <p> tags. So I'm looking for a regex to replace all these carriage returns with spaces but not touch any carriage returns outside the <p> tags.

Any help is greatly appreciated. Thanks!

Upvotes: 1

Views: 5568

Answers (7)

Alan Moore
Alan Moore

Reputation: 75222

[\r\n]+(?=(?:[^<]+|<(?!/?p\b))*</p>)

The first part matches one or more of any kind of line separator (\n, \r\n, or \r). The rest is a lookahead that attempts to match everything up to the next closing </p> tag, but if it finds an opening <p> tag first, the match fails.

Note that this regex can be fooled very easily, for example by SGML comments, <script> elements, or plain old malformed HTML. Also, I'm assuming your regex flavor supports positive and negative lookaheads. That's a pretty safe assumption these days, but if the regex doesn't work for you, we'll need to know exactly which language or tool you're using.

Upvotes: 1

hobbs
hobbs

Reputation: 239841

This is the "almost good enough" lexing solution promised in my other answer, to sketch how it can be done. It makes a half-hearted attempt at coping with attributes, but not seriously. It also doesn't attempt to cope with unencoded "<" in attributes. These are relatively minor failings, and it does handle nested P tags, but as described in the comments it's totally unable to handle the case where someone doesn't close a P, because we can't do that without a thorough understanding of HTML. Considering how prevalent that practice still is, it's safe to declare this code "nearly useless". :)

#!/usr/bin/perl
use strict;
use warnings;

while ($html !~ /\G\Z/cg) {
  if ($html =~ /\G(<p[^>]*>)/cg) {
    $output .= $1;
    $in_p ++;
  } elsif ($html =~ m[\G(</p>)]cg) {
    $output .= $1;
    $in_p --; # Woe unto anyone who doesn't provide a closing tag.
    # Tag soup parsers are good for this because they can generate an
    # "artificial" end to the P when they find an element that can't contain
    # a P, or the end of the enclosing element. We're not smart enough for that.
  } elsif ($html =~ /\G([^<]+)/cg) {
    my $text = $1;
    $text =~ s/\s*\n\s*/ /g if $in_p;
    $output .= $text;
  } elsif ($html =~ /\G(<)/cg) {
    $output .= $1;
  } else {
    die "Can't happen, but not having an else is scary!";
  }
}

Upvotes: 0

hobbs
hobbs

Reputation: 239841

A single-regex solution is basically impossible here. If you absolutely insist on not using an HTML parser, and you can count on your input being well-formed and predictable then you can write a simple lexer that will do the job (and I can provide sample code) -- but it's still not a very good idea :)

For reference:

Upvotes: 4

Michał Niklas
Michał Niklas

Reputation: 54302

I think it should work like this:

  1. get whole paragraph (text between <p> and </p>) from teh body
  2. create copy of this paragraph
  3. in copy replace \n with space
  4. in the body repace paragraph with modified copy

You can do it using regex, but I think simple character scanning can be used.

Some code in Python:

rx = re.compile(r'(<p>.*?</p>)', re.IGNORECASE | re.MULTILINE | re.DOTALL)

def get_paragraphs(body):
    paragraphs = []
    body_copy = body
    rxx = rx.search(body_copy)
    while rxx:
        paragraphs.append(rxx.group(1))
        body_copy = body_copy[rxx.end(1):]
        rxx = rx.search(body_copy)
    return paragraphs

def replace_paragraphs(body):
    paragraphs = get_paragraphs(body)
    for par in paragraphs:
        par_new = par.replace('\n', ' ')
        body = body.replace(par, par_new)
    return body

def main():
    new_body = replace_paragraphs(BODY)
    print(new_body)

main() 

Upvotes: 0

NawaMan
NawaMan

Reputation: 25687

Just use '\n' but ensure that you enable multiple line regex.

Upvotes: 0

Alex Martelli
Alex Martelli

Reputation: 881575

Regular expressions are singularly unsuitable to deal with "balanced parentheses" kinds of problems, even though people persist in trying to shoehorn them there (and some implementations -- I'm thinking of very recent perl releases, for example -- try to cooperate with this widespread misconception by extending and stretching "regular expressions" well beyond the CS definition thereof;-).

If you don't have to deal with nesting, it's comfortably doable in a two-pass approach -- grab each paragraph with e.g. <p>.*?</p> (possibly with parentheses for grouping), then perform the substitution within each paragraph thus identified.

Upvotes: 2

Laurence Gonsalves
Laurence Gonsalves

Reputation: 143144

The standard answer is: don't try to process HTML (or SGML or XML) with a regex. Use a proper parser.

Upvotes: 3

Related Questions