vitperov
vitperov

Reputation: 1407

regular expressions: extract text between two markers

I'm trying to write an Python parser to extract some information from html-pages.

It should extract text from between <p itemprop="xxx"> and </p>

I use regular expression:

m = re.search(ur'p>(?P<text>[^<]*)</p>', html)

but it can't parse file if it is another tags between them. For example:

<p itemprop="xxx"> some text <br/> another text </p>

As I understood [^<] is exception only for one symbol. How to write "everything except </p>" ?

Upvotes: 2

Views: 2403

Answers (2)

enrico.bacis
enrico.bacis

Reputation: 31524

You can use:

m = re.search(ur'p>(?P<text>.*?)</p>', html)

This is a lazy match, it will match everything until </p>. You should also consider using an HTML parser like BeautifulSoup which, after installation, can be used with CSS Selectors like this:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
m = soup.select('p[itemprop="xxx"]')

Upvotes: 2

Robᵩ
Robᵩ

Reputation: 168876

1) Never use regular expressions to parse HTML.

2) The following regular expression will work some of the time, on some HTML:

#!/usr/bin/python2.7

import re

pattern = ur'''
    (?imsx)             # ignore case, multiline, dot-matches-newline, verbose
    <p.*?>              # match first marker
    (?P<text>.*?)       # non-greedy match anything
    </p.*?>             # match second marker
'''

print re.findall(pattern, '<p>hello</p>')
print re.findall(pattern, '<p>hello</p> and <p>goodbye</p>')
print re.findall(pattern, 'before <p>hello</p> and <p><i>good</i>bye</p> after')
print re.findall(pattern, '<p itemprop="xxx"> some text <br/> another text </p>')

As another answer pointed out, .*? is the non-greedy pattern which matches any character.

Upvotes: 1

Related Questions