Reputation: 65
I'm writing a script that should open 10 text files in turn (they are source codes from different webpages). I then want the script to go through and replace any instances of <br />
with \n
. I then want it to delete the whole header, essentially. In any case, the document always starts with DOCTYPE
and the last line before the information that I want ends
"decoration:underline">no year</span><br />
As far as I'm aware, the regex /.../s
means 'ignore line breaks', and I've escaped the HTML /
that appears in the </span>
tag.
So far, I have the following
import re
def create_linebreaks(l):
l = l.replace('<br />', r'\n')
return l
def clean_up(line):
line = re.sub(r'/^<!DOCTYPE.+no year<\/span>/s', '', line)
return line
data = """<!DOCTYPE html><html class='v2' dir='ltr' xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' movie/file/show/episodes is 2763.</p>A LOAD OF OTHER HTML I DON'T WANT TO BE IN THE OUTPUT
<!-- google_ad_section_start(weight=ignore) --><span class="listings"><span style="font-size:large;font-weight:bold; text-decoration:underline">no year</span><br /> <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b> <i style="font-size:small"> 3.5 stars, 1hr 24m <a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " /> <br /> <br /> <b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b> <i style="font-size:small"> 3.7 stars, 1hr 28m <a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " /> <br /> <br />"""
create_linebreaks(data)
clean_up(data)
print data
raw_input()
All I get out, however is the same string.
Desired output is something like:
""" <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b> <i style="font-size:small"> 3.5 stars, 1hr 24m <a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " />
<b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b> <i style="font-size:small"> 3.7 stars, 1hr 28m <a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " /> """
Upvotes: 1
Views: 1043
Reputation: 87084
The main problem is that your regex pattern is wrong for Python.
In r'/^<!DOCTYPE.+no year<\/span>/s'
, the leading /
and trailing /s
are considered to be part of the pattern, not modifiers of its behaviour. This looks like PCRE regex syntax a la PHP, and it is not supported in Python. Instead, to get .
to match any character including newline, you need to set the re.DOTALL
flag as shown below.
The other problem is that the return value from create_linebreaks()
and clean_up()
is not assigned back to data
, so the changes are lost.
Also, you don't want a raw string for the newline character in create_linebreaks()
, a normal string is fine (otherwise you would replace <br />
with \\n
).
import re
def create_linebreaks(l):
l = l.replace('<br />', '\n')
return l
def clean_up(line):
line = re.sub(r'^<!DOCTYPE.+no year<\/span>', '', line, flags=re.DOTALL)
return line
data = """<!DOCTYPE html><html class='v2' dir='ltr' xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' movie/file/show/episodes is 2763.</p>A LOAD OF OTHER HTML I DON'T WANT TO BE IN THE OUTPUT
<!-- google_ad_section_start(weight=ignore) --><span class="listings"><span style="font-size:large;font-weight:bold; text-decoration:underline">no year</span><br /> <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b> <i style="font-size:small"> 3.5 stars, 1hr 24m <a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " /> <br /> <br /> <b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b> <i style="font-size:small"> 3.7 stars, 1hr 28m <a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " /> <br /> <br />"""
data = create_linebreaks(data)
data = clean_up(data)
>>> print data
<b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b> <i style="font-size:small"> 3.5 stars, 1hr 24m <a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " />
<b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b> <i style="font-size:small"> 3.7 stars, 1hr 28m <a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " />
>>>
Upvotes: 1