Reputation: 65

Regex with multiple lines and HTML tags

I'm writing a script that should open 10 text files in turn (they are source codes from different webpages). I then want the script to go through and replace any instances of <br /> with \n. I then want it to delete the whole header, essentially. In any case, the document always starts with DOCTYPE and the last line before the information that I want ends

"decoration:underline">no year</span><br />

As far as I'm aware, the regex /.../s means 'ignore line breaks', and I've escaped the HTML / that appears in the </span> tag. So far, I have the following

import re
def create_linebreaks(l):
    l = l.replace('<br />', r'\n')
    return l
def clean_up(line):
    line = re.sub(r'/^<!DOCTYPE.+no year<\/span>/s', '', line)
    return line

data = """<!DOCTYPE html><html class='v2' dir='ltr' xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' movie/file/show/episodes is 2763.</p>A LOAD OF OTHER HTML I DON'T WANT TO BE IN THE OUTPUT
<!-- google_ad_section_start(weight=ignore) --><span class="listings"><span style="font-size:large;font-weight:bold; text-decoration:underline">no year</span><br />  <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.5 stars, 1hr 24m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " />  <br />  <br />  <b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.7 stars, 1hr 28m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " />  <br />  <br />"""

create_linebreaks(data)
clean_up(data)
print data
raw_input()

All I get out, however is the same string.

Desired output is something like:

"""  <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.5 stars, 1hr 24m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " />  

<b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.7 stars, 1hr 28m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " />  """

Upvotes: 1

Answers (1)

mhawke

Reputation: 87084

The main problem is that your regex pattern is wrong for Python.

In r'/^<!DOCTYPE.+no year<\/span>/s', the leading / and trailing /s are considered to be part of the pattern, not modifiers of its behaviour. This looks like PCRE regex syntax a la PHP, and it is not supported in Python. Instead, to get . to match any character including newline, you need to set the re.DOTALL flag as shown below.

The other problem is that the return value from create_linebreaks() and clean_up() is not assigned back to data, so the changes are lost.

Also, you don't want a raw string for the newline character in create_linebreaks(), a normal string is fine (otherwise you would replace <br /> with \\n).

import re

def create_linebreaks(l):
    l = l.replace('<br />', '\n')
    return l

def clean_up(line):
    line = re.sub(r'^<!DOCTYPE.+no year<\/span>', '', line, flags=re.DOTALL)
    return line

data = """<!DOCTYPE html><html class='v2' dir='ltr' xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' movie/file/show/episodes is 2763.</p>A LOAD OF OTHER HTML I DON'T WANT TO BE IN THE OUTPUT
<!-- google_ad_section_start(weight=ignore) --><span class="listings"><span style="font-size:large;font-weight:bold; text-decoration:underline">no year</span><br />  <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.5 stars, 1hr 24m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " />  <br />  <br />  <b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.7 stars, 1hr 28m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " />  <br />  <br />"""

data = create_linebreaks(data)
data = clean_up(data)

>>> print data

  <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.5 stars, 1hr 24m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " />  

  <b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.7 stars, 1hr 28m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " />  


>>>

Upvotes: 1

Regex with multiple lines and HTML tags

Answers (1)

Related Questions