Reputation:
When I make a request to a url using scrapy shell, I get back something like this:
In [6]: sel.xpath("//div[@class='my_class']").extract()
[u'<div class="my_class"><ul><li class="parent">\n<a href="/category/tractors-ride-on-mowers/">\n\u0422\u0420\u0410\u041a\u0422\u041e\u0420\u042b \u0438 \u0420\u0410\u0419\u0414\u0415\u0420\u042b</a>\n<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">\u0421\u0430\u0434\u043e\u0432\u044b\u0435 \u0442\u0440\u0430\u043a\u0442\u043e\u0440\u04....
How can I convert that into a readable string?
Upvotes: 2
Views: 3574
Reputation: 20748
A few comments:
sel.xpath("//div[@class='my_class']")
selects div
elements.
sel.xpath("//div[@class='my_class']").extract()
gets you a string representation of the selected elements as HTML, as a list, and unicode content as \u
escape sequences if text nodes inside the selection contain Unicode code points.
You can alternatively ask for the string representation of that selected node using XPath's string()
function, directly:
sel.xpath("string(//div[@class='my_class'])").extract()
or use a common pattern of string-joining of text()
nodes: "".join(sel.xpath("//div[@class='my_class']//text()").extract())
Note that string()
will consider only the 1st element matching the expression as argument. From XPath 1.0 specs:
A node-set is converted to a string by returning the string-value of the node in the node-set that is first in document order.
Example scrapy shell session:
$ scrapy shell
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7f06700bc2d0>
[s] item {}
[s] settings <scrapy.settings.Settings object at 0x7f06700b6f10>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: import scrapy
In [2]: sel = scrapy.Selector(text=u'''<div class="my_class"><ul><li class="parent">\n<a href="/category/tractors-ride-on-mowers/">\n\u0422\u0420\u0410\u041a\u0422\u041e\u0420\u042b \u0438 \u0420\u0410\u0419\u0414\u0415\u0420\u042b</a>\n<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">\u0421\u0430\u0434\u043e\u0432\u044b\u0435 \u0442\u0440\u0430\u043a\u0442\u043e\u0440''')
In [3]: print "".join(sel.xpath('//div[@class="my_class"]//text()').extract())
ТРАКТОРЫ и РАЙДЕРЫ
Садовые трактор
In [4]: for r in sel.xpath('string(//div[@class="my_class"])').extract():
print r
...:
ТРАКТОРЫ и РАЙДЕРЫ
Садовые трактор
In [5]:
Upvotes: 1
Reputation: 16037
once you print it (or write it into a file) it will be readable
>>> u = u'<div class="my_class"><ul><li class="parent">\n<a href="/category/tractors-ride-on-mowers/">\n\u0422\u0420\u0410\u041a\u0422\u041e\u0420\u042b \u0438 \u0420\u0410\u0419\u0414\u0415\u0420\u042b</a>\n<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">\u0421\u0430\u0434\u043e\u0432\u044b\u0435 \u0442\u0440\u0430\u043a\u0442\u043e\u0440'
>>> print (u)
<div class="my_class"><ul><li class="parent">
<a href="/category/tractors-ride-on-mowers/">
ТРАКТОРЫ и РАЙДЕРЫ</a>
<div class="sub1"><div class="str"></div><ul><li><a href="/category/lawn-tractors/" class="">Садовые трактор
>>>
Upvotes: 5