Reputation: 443
I have this text:
<div style="margin-left:10px;margin-right:10px;">
<!-- start of lyrics -->
There are times when I've wondered<br />
And times when I've cried<br />
When my prayers they were answered<br />
At times when I've lied<br />
But if you asked me a question<br />
Would I tell you the truth<br />
Now there's something to bet on<br />
You've got nothing to lose<br />
<br />
When I've sat by the window<br />
And gazed at the rain<br />
With an ache in my heart<br />
But never feeling the pain<br />
And if you would tell me<br />
Just what my life means<br />
Walking a long road<br />
Never reaching the end<br />
<br />
God give me the answer to my life<br />
God give me the answer to my dreams<br />
God give me the answer to my prayers<br />
God give me the answer to my being
<!-- end of lyrics -->
</div>
I want to print the lyrics of this song, but re.findall
and re.search don't work in this case. How do I? I'm using this code:
lyrics = re.findall('<div style="margin-left:10px;margin-right:10px;">(.*?)</div>', open('file.html','r').read())
for words in lyrics:
print words
Upvotes: 1
Views: 139
Reputation: 920
Try this:
with open(r'<file_path>','r') as file:
for line in file:
if re.match(r'^<', line) == None:
print line[:line.find(r'<')]
OUTPUT
There are times when I've wondered
And times when I've cried
When my prayers they were answered
At times when I've lied
But if you asked me a question
Would I tell you the truth
Now there's something to bet on
You've got nothing to lose
When I've sat by the window
And gazed at the rain
With an ache in my heart
But never feeling the pain
And if you would tell me
Just what my life means
Walking a long road
Never reaching the end
God give me the answer to my life
God give me the answer to my dreams
God give me the answer to my prayers
God give me the answer to my being
EDIT: Using Url lib and extracting lyrics from web:
from lxml import etree
import urllib, StringIO
# Rip file from URL
resultado=urllib.urlopen('http://www.azlyrics.com/lyrics/ironmaiden/noprayerforthedying.html')
html = resultado.read()
# Parse html to etree
parser= etree.HTMLParser()
tree=etree.parse(StringIO.StringIO(html),parser)
# Apply the xpath rule
e = tree.xpath("//div[@style='margin-left:10px;margin-right:10px;']/text()")
# print output
for i in e:
print str(i).strip()
Upvotes: 1
Reputation: 39
For this specific part of HTML code, I don't see why re.findall doesn't work. Four lines of actual code plus the text can result in the output.
from re import findall
html = """
<div style="margin-left:10px;margin-right:10px;">
<!-- start of lyrics -->
There are times when I've wondered<br />
And times when I've cried<br />
When my prayers they were answered<br />
At times when I've lied<br />
But if you asked me a question<br />
Would I tell you the truth<br />
Now there's something to bet on<br />
You've got nothing to lose<br />
<br />
When I've sat by the window<br />
And gazed at the rain<br />
With an ache in my heart<br />
But never feeling the pain<br />
And if you would tell me<br />
Just what my life means<br />
Walking a long road<br />
Never reaching the end<br />
<br />
God give me the answer to my life<br />
God give me the answer to my dreams<br />
God give me the answer to my prayers<br />
God give me the answer to my being
<!-- end of lyrics -->
</div>
"""
raw = findall(r'.*<br />', html)
for line in raw:
line = line.strip('<br />')
print(line)
Upvotes: 0
Reputation: 59604
You should not use regular expression to parse HTML.
Looks like you are scraping a web-site. You could use scrapy
and lxml
inside it with xpath
.
Python 2.7.5+ (default, Sep 19 2013, 13:48:49)
>>> html = """<div style="margin-left:10px;margin-right:10px;">
... <!-- start of lyrics -->
... There are times when I've wondered<br />
... And times when I've cried<br />
... When my prayers they were answered<br />
... At times when I've lied<br />
... But if you asked me a question<br />
... Would I tell you the truth<br />
... Now there's something to bet on<br />
... You've got nothing to lose<br />
... <br />
... When I've sat by the window<br />
... And gazed at the rain<br />
... With an ache in my heart<br />
... But never feeling the pain<br />
... And if you would tell me<br />
... Just what my life means<br />
... Walking a long road<br />
... Never reaching the end<br />
... <br />
... God give me the answer to my life<br />
... God give me the answer to my dreams<br />
... God give me the answer to my prayers<br />
... God give me the answer to my being
... <!-- end of lyrics -->
... </div>"""
>>> import lxml.html
>>> html = lxml.html.fromstring(html)
>>> html.text_content()
"\n\nThere are times when I've wondered\nAnd times when I've cried\nWhen my prayers they were answered\nAt times when I've lied\nBut if you asked me a question\nWould I tell you the truth\nNow there's something to bet on\nYou've got nothing to lose\n\nWhen I've sat by the window\nAnd gazed at the rain\nWith an ache in my heart\nBut never feeling the pain\nAnd if you would tell me\nJust what my life means\nWalking a long road\nNever reaching the end\n\nGod give me the answer to my life\nGod give me the answer to my dreams\nGod give me the answer to my prayers\nGod give me the answer to my being\n\n"
>>>
Upvotes: 1