Regis Santos
Regis Santos

Reputation: 3749

How to parse with xml.etree? Python

Python 3.5

See the code

import urllib.request
from xml.etree import ElementTree as ET

url = 'http://www.sat.gob.mx/informacion_fiscal/tablas_indicadores/Paginas/tipo_cambio.aspx'


def conectar(url):
    page = urllib.request.urlopen(url)
    return page.read()

root = ET.fromstring(conectar(url))
s = root.findall("//*[contains(.,'21/')]")

A need extract '21/', but return this error:

Erro:

Traceback (most recent call last):
  File "crawler.py", line 11, in <module>
    root = ET.fromstring(conectar(url))
  File "/home/rg3915/.pyenv/versions/3.5.0/lib/python3.5/xml/etree/ElementTree.py", line 1321, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: unbound prefix: line 146, column 8

But I do not know how to solve this error.

Upvotes: 0

Views: 353

Answers (2)

DavinirJr
DavinirJr

Reputation: 26

You could start with:

import urllib2
from bs4 import BeautifulSoup

url = 'http://www.sat.gob.mx/informacion_fiscal/tablas_indicadores/Paginas/tipo_cambio.aspx'
response = urllib2.urlopen(url)
html = response.read()
dom = BeautifulSoup(html, 'html.parser')

tables = dom.find_all("table")
if len(tables):
    table = tables[0]
    print table

(tested in python 2.7)

Upvotes: 1

Gary van der Merwe
Gary van der Merwe

Reputation: 9523

While the document you are trying to parse claims to be xhtml, it is invalid xml due to the unbound prefix.

<gcse:search></gcse:search>

The gcse ns prefix is not defined for the document.

BeautifulSoup would probably be much better suited for what you are trying to do, because it is not fussy about the document being 100% valid.

Upvotes: 1

Related Questions