Reputation: 955
I'm using a Raspberry Pi 1B+ w/ Debian Linux:
Linux rbian 3.18.0-trunk-rpi #1 PREEMPT Debian 3.18.5-1~exp1+rpi16 (2015-03-28) armv6l GNU/Linux
As part of a larger Python program I'm using this code:
#!/usr/bin/env python
import time
from urllib2 import Request, urlopen
from bs4 import BeautifulSoup
_url="http://xml.buienradar.nl/"
s1 = time.time()
req = Request(_url)
print "Request = {0}".format(time.time() - s1)
s2 = time.time()
response = urlopen(req)
print "URLopen = {0}".format(time.time() - s2)
s3 = time.time()
output = response.read()
print "Read = {0}".format(time.time() - s3)
s4 = time.time()
soup = BeautifulSoup(output)
print "Soup (1) = {0}".format(time.time() - s4)
s5 = time.time()
MSwind = str(soup.buienradarnl.weergegevens.actueel_weer.weerstations.find(id=6350).windsnelheidms)
GRwind = str(soup.buienradarnl.weergegevens.actueel_weer.weerstations.find(id=6350).windrichtinggr)
ms = MSwind.replace("<"," ").replace(">"," ").split()[1]
gr = GRwind.replace("<"," ").replace(">"," ").split()[1]
print "Extracting info = {0}".format(time.time() - s5)
s6 = time.time()
soup = BeautifulSoup(urlopen(_url))
print "Soup (2) = {0}".format(time.time() - s6)
s5 = time.time()
MSwind = str(soup.buienradarnl.weergegevens.actueel_weer.weerstations.find(id=6350).windsnelheidms)
GRwind = str(soup.buienradarnl.weergegevens.actueel_weer.weerstations.find(id=6350).windrichtinggr)
ms = MSwind.replace("<"," ").replace(">"," ").split()[1]
gr = GRwind.replace("<"," ").replace(">"," ").split()[1]
print "Extracting info = {0}".format(time.time() - s5)
When I run it, I get this output:
Request = 0.00394511222839
URLopen = 0.0579500198364
Read = 0.0346400737762
Soup (1) = 23.6777830124
Extracting info = 0.183892965317
Soup (2) = 36.6107468605
Extracting info = 0.382317781448
So, the BeautifulSoup command takes about half a minute to process the _url
.
I would really love it if this could be done in under 10 seconds.
Any suggestions that would significantly speed up the code (by at least -60%) would be extremely welcome.
Upvotes: 3
Views: 9590
Reputation: 43495
Using requests
and regular expressions can be a lot shorter and faster. For such relatively simple data gathering regexes work fine.
#!/usr/bin/env python
from __future__ import print_function
import re
import requests
import time
_url = "http://xml.buienradar.nl/"
_regex = '<weerstation id="6391">.*?'\
'<windsnelheidMS>(.*?)</windsnelheidMS>.*?'\
'<windrichtingGR>(.*?)</windrichtingGR>'
s1 = time.time()
br = requests.get(_url)
print("Request = {0}".format(time.time() - s1))
s5 = time.time()
MSwind, GRwind = re.findall(_regex, br.text)[0]
print("Extracting info = {0}".format(time.time() - s5))
print('wind speed', MSwind, 'm/s')
print('wind direction', GRwind, 'degrees')
On my desktop (which is not a raspberry, though :-) ) this runs very fast;
Request = 0.0723416805267334
Extracting info = 0.0009412765502929688
wind speed 2.35 m/s
wind direction 232.6 degrees
Of course this particular regex would fail if the windsnelheidMS
and windrichtingGR
tags were reversed. But given that the XML is most probably computer-generated that doesn't seem likely.
And there is an solution for it. By first using a regex to capture the text between <weerstation id="6391">
and </weerstation>
, and then use two other regexes to find the wind speed and direction.
Upvotes: 0
Reputation: 1121744
Install the lxml
library; once installed BeautifulSoup will use it as the default parser.
lxml
parser the page using the libxml2
C library, which is significantly faster than the default html.parser
backend, implemented in pure Python.
You can then also parse the page as XML instead of as HTML:
soup = BeautifulSoup(output, 'xml')
Parsing your given page with lxml
should be faster; I can parse the page almost 50 times per second:
>>> timeit("BeautifulSoup(output, 'xml')", 'from __main__ import BeautifulSoup, output', number=50)
1.1700470447540283
Still, I wonder if you are missing some other Python acceleration libraries, as I certainly cannot reproduce your results even with the built-in parser:
>>> timeit("BeautifulSoup(output, 'html.parser')", 'from __main__ import BeautifulSoup, output', number=50)
1.7218239307403564
Perhaps you are memory constrained and the large-ish document causes your OS to swap memory a lot? Memory swapping (writing pages to disk and loading other pages from disk) can bring even the fastest programs to a grinding halt.
Note that instead of using str()
on tag elements and splitting off the tags, you can get the value from a tag simply by using the .string
attribute:
station_6350 = soup.buienradarnl.weergegevens.actueel_weer.weerstations.find(id=6350)
ml = station_6350.windsnelheidMS.string
gr = station_6350.windrichtingGR.string
If you are using the XML parser, take into account that tagnames must match case (HTML is a case-insensitive mark-up language).
Since this is an XML document, another option would be to use the lxml
ElementTree model; you can use XPath expressions to extract the data:
from lxml import etree
response = urlopen(_url)
for event, elem in etree.iterparse(response, tag='weerstation'):
if elem.get('id') == '6350':
ml = elem.find('windsnelheidMS').text
gr = elem.find('windrichtingGR').text
break
# clear elements we are not interested in, adapted from
# http://stackoverflow.com/questions/12160418/why-is-lxml-etree-iterparse-eating-up-all-my-memory
elem.clear()
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
This should only build the minimal object tree required, clearing out the weather stations you don't need as you go along the document.
Demo:
>>> from lxml import etree
>>> from urllib2 import urlopen
>>> _url = "http://xml.buienradar.nl/"
>>> response = urlopen(_url)
>>> for event, elem in etree.iterparse(response, tag='weerstation'):
... if elem.get('id') == '6350':
... ml = elem.find('windsnelheidMS').text
... gr = elem.find('windrichtingGR').text
... break
... # clear elements we are not interested in
... elem.clear()
... for ancestor in elem.xpath('ancestor-or-self::*'):
... while ancestor.getprevious() is not None:
... del ancestor.getparent()[0]
...
>>> ml
'4.64'
>>> gr
'337.8'
Upvotes: 8