Python BeautifulSoup, parsing XML

Question

I would like to extract just the SQL error line:

SQL Error: failed

- however all text from msg tag is printed:

Test: 01
SQL Error: failed


Test: 01
SQL Error: failed

file.xml



Test: 01
SQL Error: failed




Test: 01
SQL Error: failed

Code:

import re
from BeautifulSoup import BeautifulStoneSoup

file = "file.xml"

with open(file, 'r') as f:
    fobj = f.read()
    bobj = BeautifulStoneSoup(fobj)
    pattern = re.compile('SQL Error')
    for error in bobj.findAll('msg', text=pattern):
        print error

alecxe · Accepted Answer

This is how it is supposed to be working - you are getting a Tag class instance as a result of find_all() call. Even if you print out the error.text - you'll get a complete text of the msg element:

Test: 01
SQL Error: failed

Assuming you want to extract the failed part, here is what you can do:

pattern = re.compile('SQL Error: (\w+)')
for error in bobj.find_all("msg", text=pattern):
    print pattern.search(error.text).group(1)

Here we are using capturing groups to save one or more alphanumeric characters (\w+) after the SQL Error: text.

Also, you should definitely upgrade to BeautifulSoup 4:

pip install beautifulsoup4

And then import it as:

from bs4 import BeautifulSoup

Python BeautifulSoup, parsing XML

Answers (2)

Related Questions