ExperimentsWithCode
ExperimentsWithCode

Reputation: 1184

Using lxml to process html from requests. TypeError: can't pickle _ElementUnicodeResult objects

I am trying to get data found at a specific xpath on a page. I am able to get to the page via requests. I have verified I am at the correct page by using r.text to print the source code to my screen and comparing the displayed text to the text I am looking for.

r.text returns a string that is difficult to extract the info I want out of. I have been informed lxml is the way to go in order to search for info by xpath. Unfortunately, I am getting a type error.

from lxml import html
import requests

payload = {'login_pass': 'password', 'login_user': 'username','submit':'go'}
r = requests.get("website", params=payload)

print r.encoding
tree = html.fromstring(r.text)
print tree
print tree.text_content()

returns

UTF-8
<Element html at 0x10dab8d08>

Traceback (most recent call last):
  File "/Users/Me/Documents/PYTHON/GetImageAsPdf/ImageToPDF_requests_beta.py", line 11, in <module>
    print tree.text_content()
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/idlelib/PyShell.py", line 1343, in write
    return self.shell.write(s, self.tags)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/idlelib/rpc.py", line 595, in __call__
    value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/idlelib/rpc.py", line 210, in remotecall
    seq = self.asynccall(oid, methodname, args, kwargs)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/idlelib/rpc.py", line 225, in asynccall
    self.putmessage((seq, request))
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/idlelib/rpc.py", line 324, in putmessage
    s = pickle.dumps(message)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy_reg.py", line 70, in _reduce_ex
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle _ElementUnicodeResult objects

I tried checking the headers

r.headers

returns

{'charset': 'utf-8',
 'x-powered-by': 'PHP/5.3.3',
 'transfer-encoding': 'chunked',
 'set-cookie': 'PHPSESSID=c6i7kph59nl9ocdlkckmjavas1; path=/, LOGIN_USER=deleted; expires=Tue, 15-Oct-2013 15:12:08 GMT; path=/',
 'expires': 'Thu, 19 Nov 1981 08:52:00 GMT',
 'server': 'Apache/2.2.15 (CentOS)',
 'connection': 'close',
 'pragma': 'no-cache',
 'cache-control': 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0',
 'date': 'Wed, 15 Oct 2014 15:12:09 GMT',
 'content-type': 'text/html; charset=UTF-8'}

My goal is to be able to search the tree via xpath like this:

quantity = tree.xpath('/html/body/form[1]/table[3]/tbody[1]/tr/td[2]/table/tbody/tr/td[1]/table/tbody/tr/td/table[1]/tbody/tr[1]/td[2]/strong')

Can you please help me identify where I am going wrong?

Upvotes: 4

Views: 858

Answers (1)

broox
broox

Reputation: 3960

You should be able to convert the _ElementUnicodeResult object into a regular, picklable unicode string.

With Python 2, simply wrap it with unicode(), e.g. print unicode(tree.text_content())

And with Python 3, simply wrap it in str(), e.g. str(tree.text_content())

Upvotes: 4

Related Questions