Reputation: 1147
I need to quickly extract text from HTML files. I am using the following regular expressions instead of a full-fledged parser since I need to be fast rather than accurate (I have more than a terabyte of text). The profiler shows that most of the time in my script is spent in the re.sub procedure. What are good ways of speeding up my process? I can implement some portions in C, but I wonder whether that will help given that the time is spent inside re.sub, which I think would be efficiently implemented.
# Remove scripts, styles, tags, entities, and extraneous spaces:
scriptRx = re.compile("<script.*?/script>", re.I)
styleRx = re.compile("<style.*?/style>", re.I)
tagsRx = re.compile("<[!/]?[a-zA-Z-]+[^<>]*>")
entitiesRx = re.compile("&[0-9a-zA-Z]+;")
spacesRx = re.compile("\s{2,}")
....
text = scriptRx.sub(" ", text)
text = styleRx.sub(" ", text)
....
Thanks!
Upvotes: 4
Views: 7453
Reputation: 5565
I would use simple program with regular Python partition something like, this, but it is tested only with one style example file:
## simple filtering when not hierarchical tags inside other discarded tags
start_tags=('<style','<script')
end_tags=('</style>','</script>')
##print("input:\n %s" % open('giant.html').read())
out=open('cleaned.html','w')
end_tag=''
for line in open('giant.html'):
line=' '.join(line.split())
if end_tag:
if end_tag in line:
_,tag,end = line.partition(end_tags[index])
if end.strip():
out.write(end)
end_tag=''
continue ## discard rest of line if no end tag found in line
found=( index for index in (start_tags.index(start_tag)
if start_tag in line else ''
for start_tag in start_tags)
if index is not '')
for index in found:
start,tag,end = line.partition(start_tags[index])
# drop until closing angle bracket of start tag
tag,_ ,end = end.partition('>')
# check if closing tag already in same line
if end_tags[index] in end:
_,tag,end = end.partition(end_tags[index])
if end.strip():
out.write(end)
end_tag = '' # end tag reset after found
else:
end_tag=end_tags[index]
out.write(end) # no end tag at same line
if not end_tag: out.write(line+'\n')
out.close()
## print 'result:\n%s' % open('cleaned.html').read()
Upvotes: 0
Reputation: 75272
You're processing each file five times, so the first thing you should do (as Paul Sanwald said) is try to reduce that number by combining your regexes together. I would also avoid using reluctant quantifiers, which are designed for convenience at the expense of efficiency. Consider this regex:
<script.*?</script>
Each time the .
goes to consume another character, it first has to make sure </script>
won't match at that spot. It's almost like doing a negative lookahead at every position:
<script(?:(?!</script>).)*</script>
But we know there's no point doing the lookahead if the next character is anything but <
, and we can tailor the regex accordingly:
<script[^<]*(?:<(?!/script>)[^<]*)*</script>
When I test them in RegexBuddy with this target string:
<script type="text/javascript">var imagePath='http://sstatic.net/stackoverflow/img/';</script>
...the reluctant regex takes 173 steps to make the match, while the tailored regex takes only 28.
Combining your first three regexes into one yields this beast:
<(?:(script|style)[^<]*(?:<(?!/\1)[^<]*)*</\1>|[!/]?[a-zA-Z-]+[^<>]*>)
You might want to zap the <HEAD>
element while you're at it (i.e., (script|style|head)
).
I don't know what you're doing with the fourth regex, for character entities--are you just deleting those, too? I'm guessing the fifth regex has to be run separately, since some of the whitespace it's cleaning up is generated by the earlier steps. But try it with the first three regexes combined and see how much difference it makes. That should tell you if it's worth going forward with this approach.
Upvotes: 5
Reputation: 7736
First, use an HTML parser built for this, like BeautifulSoup:
http://www.crummy.com/software/BeautifulSoup/
Then, you can identify remaining particular slow spots with the profiler:
http://docs.python.org/library/profile.html
And for learning about regular expressions, I've found Mastering Regular Expressions very valuable, no matter what the programming language:
http://oreilly.com/catalog/9781565922570
Also:
How can I debug a regular expression in python?
Due to the reclarification of the use-case, then for this request, I would say the above is not what you want. My alternate recommendation would be: Speeding up regular expressions in Python
Upvotes: 8
Reputation: 7736
If your use-case is indeed to parse a few things for each of millions of documents, then my above answer won't help. I recommend some heuristics, like doing a couple "straight text" regexes on them to begin with - like just plain /script/
and /style/
to throw things out quickly if you can. In fact, do you really need to do the end-tag check at all? Isn't <style
good enough? Leave validation for someone else. If the quick ones succeed, then put the rest into a single regex, like /<script|<style|\s{2,}|etc.../
so that it doesn't have to go through so much text once for each regex.
Upvotes: 1
Reputation: 131800
The suggestion to use an HTML parser is a good one, since it'll quite possibly be faster than regular expressions. But I'm not sure BeautifulSoup is the right tool for the job, since it constructs a parse tree from the entire file and stores the whole thing in memory. For a terabyte of HTML, you'd need an obscene amount of RAM to do that ;-) I'd suggest you look at HTMLParser
, which is written at a lower level than BeautifulSoup, but I believe it's a stream parser, so it will only load a bit of the text at a time.
Upvotes: 0
Reputation: 11349
one thing you can do is combine the script/style regexes using backreferences. here's some sample data:
$ cat sample
<script>some stuff</script>
<html>whatever </html>
<style>some other stuff</style>
using perl:
perl -ne "if (/<(script|style)>.*?<\/\1>/) { print $1; } " sample
it will match either script or style. I second the recommendation for "mastering regular expressions", it's an excellent book.
Upvotes: 1