user807496
user807496

Reputation: 247

Regular expression extracting data

If I have to extract values of certain attributes from the HTML page source. For ex:

If I want to get the value of address?

    <span class="address">413 W. Street</span></span><br>

EDIT: Sorry I understood the question wrong. I tried deleting this question but wasnt able to. I have posted new question here: https://stackoverflow.com/questions/9144544/regular-expressions-for-different-attributes

Upvotes: 0

Views: 101

Answers (3)

shadyabhi
shadyabhi

Reputation: 17234

It's a python code.

>>> import re
>>> s = '<span class="address">413 W. Street</span><br><span class="phone">218-999-1020</span>, <span class="region">WA</span> <span class="postal-code">87112</span><br>'
>>> re.findall(r'address">(.*?)<.*phone">(.*?)<.*region">(.*?)<.*postal-code">(.*?)<', s)
[('413 W. Street', '218-999-1020', 'WA', '87112')]
>>> 

BTW, don't forget to see this

Upvotes: 0

Marek Musielak
Marek Musielak

Reputation: 27132

You should not use regular expression to parse html. It's well explained here:

RegEx match open tags except XHTML self-contained tags

Still, if you know exact structure of the html text you want to parse, you can try use this regex (prepared for C# program, so may vary depending on your code language):

\<span[^">]*class="([^"]+)[^>]*>([^<]*)

Then you can access name of the class (e.g. address, phone, etc) in first matched group, and the value in the 2nd.

Upvotes: 0

sharkfin
sharkfin

Reputation: 3847

It's kind of difficult to use regex to scrape data from raw html since the pattern may change for different sites. It's easier to use something that can look through the DOM tree.

If you're using python, you can use BeautifulSoup. Here's the doc. It does exactly what you want. Link

Upvotes: 1

Related Questions