Reputation: 1518
This has been bugging me for a while now, I cannot use regular expressions to find a string with Beautifulsoup, and I have no idea why.
This is the line I'm having troubles with:
data = soup.find(text=re.compile('Överförda data (skickade/mottagna)
Here is the whole code if needed:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
from bs4 import BeautifulSoup
import re
import urllib2
# Fetch URL
url = 'http://192.168.1.254/cgi/b/bb/?be=0&l0=1&l1=-1'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')
# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)
soup = BeautifulSoup(response)
time = soup.find(text="Aktiv tid:").findNext('td').contents[0]
data = soup.find(text=re.compile('Överförda data (skickade/mottagna) [GB/GB]:')).findNext('td').contents[0] # complains about this line
f=open('/var/www/log.txt', 'a')
print(time + ";" + data,file=f)
f.close()
Whenever I run it, an error of type AttributeError occurs saying 'NoneType' object has no attribute 'findNext'
Because my string can be either:
so I need to use regular expressions to see wheter it matches either of these.
Thank you in advance!
(EDIT: I now changed my code (see answer below) but it is still giving me the same error:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
from bs4 import BeautifulSoup
import re
import urllib2
# Fetch URL
url = 'http://192.168.1.254/cgi/b/bb/?be=0&l0=1&l1=-1'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')
# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)
soup = BeautifulSoup(response)
time = soup.find(text="Aktiv tid:").findNext('td').contents[0]
data = soup.find(text=re.compile(re.escape(u'Överförda data (skickade/mottagna) [GB/GB]:'))).findNext('td').contents[0]
f=open('/var/www/log.txt', 'a')
print(time + ";" + data,file=f)
f.close()
Here is the relevant part of the HTML file:
<table width='100%' class='datatable' cellspacing='0' cellpadding='0'>
<tr>
<td>
</td>
<td width='30px'>
</td>
<td width='220px'>
</td>
<td width='50px'>
</td>
</tr>
<tr>
<td height='7' colspan='4'>
<img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
</td>
</tr>
<tr>
<td width='170'>
Aktiv tid: <!--This is a string I will search for.-->
</td>
<td colspan='3'>
1 dag, 17:03:46 <!--This is a piece of information I need to obtain.-->
</td>
</tr>
<tr>
<td height='7' colspan='4'>
<img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
</td>
</tr>
<tr>
<td width='170'>
Bandbredd (upp/ned) [kbps/kbps]:
</td>
<td colspan='3'>
1.058 / 21.373
</td>
</tr>
<tr>
<td height='7' colspan='4'>
<img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
</td>
</tr>
<tr>
<td width='170'>
Överförda data (skickade/mottagna) [GB/GB]: <!--This is another string I will search for.-->
</td>
<td colspan='3'>
1,67 / 42,95 <!--This is another piece of information I need to obtain.-->
</td>
</tr>
</table>
)
Upvotes: 0
Views: 334
Reputation: 1121186
BeautifulSoup operates on unicode strings, but you passed in a bytestring regex instead. Use a Unicode literal for your expression:
re.compile(re.escape(u'Överförda data (skickade/mottagna) [GB/GB]:'))
I also used re.escape()
to escape the meta characters (parentheses and square brackets) from being interpreted as regular expression info.
The UTF-8 encoding of Ö
and ö
will only match the exact byte sequence:
>>> 'Överförda'
'\xc3\x96verf\xc3\xb6rda'
>>> u'Överförda'
u'\xd6verf\xf6rda'
>>> print u'Överförda'
Överförda
>>> import re
>>> re.search('Överförda', u'Överförda data (skickade/mottagna) [GB/GB]')
>>> re.search(u'Överförda', u'Överförda data (skickade/mottagna) [GB/GB]')
<_sre.SRE_Match object at 0x107d47ed0>
This does require that you make a proper source code encoding declaration at the top of your file, see PEP 263.
Upvotes: 2
Reputation: 20163
Square brackets and parentheses are special in regular expressions. You need to escape them with a backslash if you want to match those literal characters (vs. defining capture groups, character classes, etc).
Upvotes: 1