Reputation: 175
I'm trying to do extract particular strings in markup and save them (for more complex processing on this line). So say for example, I've read in a line from a file and the current line is:
<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg" WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all">
But I want to store:
tempUrl = 'http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg'
tempWidth = 500
tempHeight = 375
tempAlt = 'Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road'
How would I go about doing that in Python?
Thanks
Upvotes: 0
Views: 56
Reputation: 31
And the regex approach:
import re
string = "YOUR STRING"
matches = re.findall("src=\"(.*?)\".*WIDTH=\"(.*?)\".*HEIGHT=\"(.*?)\".*alt=\"(.*?)\"", string)[0]
tempUrl = matches[0]
tempWidth = matches[1]
tempHeight = matches[2]
tempAlt = matches[3]
All values are string though, so cast it if you want..
And know that with regex copy/paste is a bad idea. There could be mistakes easily.
Upvotes: 0
Reputation: 8254
Though you can get away with several approaches here, I recommend using an HTML parser, which is extensible and can deal with many issues in the HTML. Here's a working example with BeautifulSoup
:
>>> from bs4 import BeautifulSoup
>>> string = """<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg" WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all">"""
>>> soup = BeautifulSoup(string, 'html.parser')
>>> for attr in ['width', 'height', 'alt']:
... print('temp{} = {}'.format(attr.title(), soup.img[attr]))
...
tempWidth = 500
tempHeight = 375
tempAlt = Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road
Upvotes: 3