Reputation: 247
If I have to extract values of certain attributes from the HTML page source. For ex:
If I want to get the value of address?
<span class="address">413 W. Street</span></span><br>
EDIT: Sorry I understood the question wrong. I tried deleting this question but wasnt able to. I have posted new question here: https://stackoverflow.com/questions/9144544/regular-expressions-for-different-attributes
Upvotes: 0
Views: 101
Reputation: 17234
It's a python code.
>>> import re
>>> s = '<span class="address">413 W. Street</span><br><span class="phone">218-999-1020</span>, <span class="region">WA</span> <span class="postal-code">87112</span><br>'
>>> re.findall(r'address">(.*?)<.*phone">(.*?)<.*region">(.*?)<.*postal-code">(.*?)<', s)
[('413 W. Street', '218-999-1020', 'WA', '87112')]
>>>
Upvotes: 0
Reputation: 27132
You should not use regular expression to parse html. It's well explained here:
RegEx match open tags except XHTML self-contained tags
Still, if you know exact structure of the html text you want to parse, you can try use this regex (prepared for C# program, so may vary depending on your code language):
\<span[^">]*class="([^"]+)[^>]*>([^<]*)
Then you can access name of the class (e.g. address, phone, etc) in first matched group, and the value in the 2nd.
Upvotes: 0
Reputation: 3847
It's kind of difficult to use regex to scrape data from raw html since the pattern may change for different sites. It's easier to use something that can look through the DOM tree.
If you're using python, you can use BeautifulSoup. Here's the doc. It does exactly what you want. Link
Upvotes: 1