Extract html data using regular expressions

Question

I have an html page that looks like this

I need to get the text

history/2c0b65635b3ac68a4d53b89521216d26.html marketing/3c0a65635b2bc68b5c43b88421306c37.html

I wrote a script in python that uses regular expressions

import re
a = re.compile("[0-9 a-z]{0,15}/[0-9 a-f]{32}.html")
print(a.match(s))

where s's value is the html page above. However when I use this script I get "None". Where did I go wrong?

alecxe · Accepted Answer

Don't use regex for parsing HTML content.

Use a specialized tool - an HTML Parser.

Example (using BeautifulSoup):

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

data = u"""Your HTML here"""

soup = BeautifulSoup(data)
for link in soup.select('td a[href]'):
    print link['href']

Prints:

history/2c0b65635b3ac68a4d53b89521216d26.html
history/2c0b65635b3ac68a4d53b89521216d26_0.html
marketing/3c0a65635b2bc68b5c43b88421306c37.html
marketing/3c0a65635b2bc68b5c43b88421306c37_0.html

Or, if you want to get the href values that follow a pattern, use:

import re

for link in soup.find_all('a', href=re.compile(r'\w+/\w{32}\.html')):
    print link['href']

where r'\w+/\w{32}\.html' is a regular expression that would be applied to an href attribute of every a tag found. It would match one or more alphanumeric characters (\w+), followed by a slash, followed by exactly 32 alphanumeric characters (\w{32}), followed by a dot (\.- needs to be escaped), followed by html.

DEMO.

Extract html data using regular expressions

Answers (2)

Related Questions