Reputation: 7260
I would like to extract just the SQL error line:
SQL Error: failed
- however all text from msg
tag is printed:
Test: 01
SQL Error: failed
Test: 01
SQL Error: failed
file.xml
<item>
<msg>
Test: 01
SQL Error: failed
</msg>
</item>
<item>
<msg>
Test: 01
SQL Error: failed
</msg>
</item>
Code:
import re
from BeautifulSoup import BeautifulStoneSoup
file = "file.xml"
with open(file, 'r') as f:
fobj = f.read()
bobj = BeautifulStoneSoup(fobj)
pattern = re.compile('SQL Error')
for error in bobj.findAll('msg', text=pattern):
print error
Upvotes: 1
Views: 282
Reputation: 9038
Using BeautifulSoup 4 you can change
print error
to
print error.get_text().strip().split("\n")[1]
error
is a tag, so you first get the string value from it with get_text()
, the you have to strip off the leading and trailing carriage returns with strip()
. You then make an array with each value being a separate line, and the value you want is the second line so you access it with [1]
.
Upvotes: 1
Reputation: 473863
This is how it is supposed to be working - you are getting a Tag
class instance as a result of find_all()
call. Even if you print out the error.text
- you'll get a complete text of the msg
element:
Test: 01
SQL Error: failed
Assuming you want to extract the failed
part, here is what you can do:
pattern = re.compile('SQL Error: (\w+)')
for error in bobj.find_all("msg", text=pattern):
print pattern.search(error.text).group(1)
Here we are using capturing groups to save one or more alphanumeric characters (\w+
) after the SQL Error:
text.
Also, you should definitely upgrade to BeautifulSoup 4:
pip install beautifulsoup4
And then import it as:
from bs4 import BeautifulSoup
Upvotes: 1