Reputation: 1407
I'm trying to write an Python parser to extract some information from html-pages.
It should extract text from between <p itemprop="xxx">
and </p>
I use regular expression:
m = re.search(ur'p>(?P<text>[^<]*)</p>', html)
but it can't parse file if it is another tags between them. For example:
<p itemprop="xxx"> some text <br/> another text </p>
As I understood [^<]
is exception only for one symbol. How to write "everything except </p>
" ?
Upvotes: 2
Views: 2403
Reputation: 31524
You can use:
m = re.search(ur'p>(?P<text>.*?)</p>', html)
This is a lazy match, it will match everything until </p>
. You should also consider using an HTML parser like BeautifulSoup which, after installation, can be used with CSS Selectors like this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
m = soup.select('p[itemprop="xxx"]')
Upvotes: 2
Reputation: 168876
1) Never use regular expressions to parse HTML.
2) The following regular expression will work some of the time, on some HTML:
#!/usr/bin/python2.7
import re
pattern = ur'''
(?imsx) # ignore case, multiline, dot-matches-newline, verbose
<p.*?> # match first marker
(?P<text>.*?) # non-greedy match anything
</p.*?> # match second marker
'''
print re.findall(pattern, '<p>hello</p>')
print re.findall(pattern, '<p>hello</p> and <p>goodbye</p>')
print re.findall(pattern, 'before <p>hello</p> and <p><i>good</i>bye</p> after')
print re.findall(pattern, '<p itemprop="xxx"> some text <br/> another text </p>')
As another answer pointed out, .*?
is the non-greedy pattern which matches any character.
Upvotes: 1