Linus
Linus

Reputation: 1518

Python Beautifulsoup not finding regular expression

This has been bugging me for a while now, I cannot use regular expressions to find a string with Beautifulsoup, and I have no idea why.

This is the line I'm having troubles with:

data = soup.find(text=re.compile('Överförda data (skickade/mottagna) 

Here is the whole code if needed:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
from bs4 import BeautifulSoup

import re
import urllib2

# Fetch URL
url = 'http://192.168.1.254/cgi/b/bb/?be=0&l0=1&l1=-1'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')

# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)

soup = BeautifulSoup(response)

time = soup.find(text="Aktiv tid:").findNext('td').contents[0]
data = soup.find(text=re.compile('Överförda data (skickade/mottagna) [GB/GB]:')).findNext('td').contents[0] # complains about this line

f=open('/var/www/log.txt', 'a')
print(time + ";" + data,file=f)
f.close()

Whenever I run it, an error of type AttributeError occurs saying 'NoneType' object has no attribute 'findNext'

Because my string can be either:

so I need to use regular expressions to see wheter it matches either of these.

Thank you in advance!

(EDIT: I now changed my code (see answer below) but it is still giving me the same error:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
from bs4 import BeautifulSoup

import re
import urllib2

# Fetch URL
url = 'http://192.168.1.254/cgi/b/bb/?be=0&l0=1&l1=-1'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')

# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)

soup = BeautifulSoup(response)

time = soup.find(text="Aktiv tid:").findNext('td').contents[0]
data = soup.find(text=re.compile(re.escape(u'Överförda data (skickade/mottagna) [GB/GB]:'))).findNext('td').contents[0]

f=open('/var/www/log.txt', 'a')
print(time + ";" + data,file=f)
f.close()

Here is the relevant part of the HTML file:

<table width='100%' class='datatable' cellspacing='0' cellpadding='0'>
  <tr>
    <td>
    </td>
    <td width='30px'>
    </td>
    <td width='220px'>
    </td>
    <td width='50px'>
    </td>
  </tr>
  <tr>
    <td height='7' colspan='4'>
      <img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
    </td>
  </tr>
  <tr>
    <td width='170'>
      Aktiv tid: <!--This is a string I will search for.-->
    </td>
    <td colspan='3'>
      1 dag, 17:03:46 <!--This is a piece of information I need to obtain.-->
    </td>
  </tr>
  <tr>
    <td height='7' colspan='4'>
      <img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
    </td>
  </tr>
  <tr>
    <td width='170'>
      Bandbredd (upp/ned) [kbps/kbps]:
    </td>
    <td colspan='3'>
      1.058 / 21.373
    </td>
  </tr>
  <tr>
    <td height='7' colspan='4'>
      <img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
    </td>
  </tr>
  <tr>
    <td width='170'>
      Överförda data (skickade/mottagna) [GB/GB]: <!--This is another string I will search for.-->
    </td>
    <td colspan='3'>
      1,67 / 42,95 <!--This is another piece of information I need to obtain.-->
    </td>
  </tr>
</table>

)

Upvotes: 0

Views: 334

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1121186

BeautifulSoup operates on unicode strings, but you passed in a bytestring regex instead. Use a Unicode literal for your expression:

re.compile(re.escape(u'Överförda data (skickade/mottagna) [GB/GB]:'))

I also used re.escape() to escape the meta characters (parentheses and square brackets) from being interpreted as regular expression info.

The UTF-8 encoding of Ö and ö will only match the exact byte sequence:

>>> 'Överförda'
'\xc3\x96verf\xc3\xb6rda'
>>> u'Överförda'
u'\xd6verf\xf6rda'
>>> print u'Överförda'
Överförda
>>> import re
>>> re.search('Överförda', u'Överförda data (skickade/mottagna) [GB/GB]')
>>> re.search(u'Överförda', u'Överförda data (skickade/mottagna) [GB/GB]')
<_sre.SRE_Match object at 0x107d47ed0>

This does require that you make a proper source code encoding declaration at the top of your file, see PEP 263.

Upvotes: 2

nobody
nobody

Reputation: 20163

Square brackets and parentheses are special in regular expressions. You need to escape them with a backslash if you want to match those literal characters (vs. defining capture groups, character classes, etc).

Upvotes: 1

Related Questions