Reputation: 533
I am using scrapy to scrape information off of a website. My xpath is working but it does not grab information from block.
Python code:
sel.xpath('//div[@class="content"]/div/blockquote/node()').extract()[0]
I am using this to grab the first blockquote on the page. It cuts off after there is a <br>
.
For example:
If I can see this:
<blockquote class="postcontent restore ">
4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)
<br>
Operating System
<br>
Windows 8.1 64
<br>
Display
</blockquote>
It will only return:
4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)
But I would prefer it return everything, including html tags and the rest of the text in blockquote.
Upvotes: 0
Views: 372
Reputation: 20748
//div[@class="content"]/div/blockquote/node()
will get you all nodes just under a blockquote
, children text nodes and element nodes.
In your case, you'll get the text nodes and the <br>
s
sel.xpath('//div[@class="content"]/div/blockquote/node()').extract()[0]
will extract only the 1st node, which is the text node with "4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)"
Here's a sample ipython session to show different outputs using selectors:
$ ipython
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
Type "copyright", "credits" or "license" for more information.
IPython 1.2.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: import scrapy
In [2]: selector = scrapy.selector.Selector(text="""<blockquote class="postcontent restore ">
...: 4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)
...: <br>
...: Operating System
...: <br>
...: Windows 8.1 64
...: <br>
...: Display
...: </blockquote>""")
In [3]: selector.xpath('blockquote/node()').extract()
Out[3]: []
In [4]: selector.xpath('.//blockquote/node()').extract()
Out[4]:
[u'\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n',
u'<br>',
u'\nOperating System\n',
u'<br>',
u'\nWindows 8.1 64\n',
u'<br>',
u'\nDisplay\n']
In [5]: selector.xpath('.//blockquote').extract()
Out[5]: [u'<blockquote class="postcontent restore ">\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n<br>\nOperating System\n<br>\nWindows 8.1 64\n<br>\nDisplay\n</blockquote>']
In [6]: selector.xpath('string(.//blockquote)').extract()
Out[6]: [u'\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n\nOperating System\n\nWindows 8.1 64\n\nDisplay\n']
In [7]: selector.xpath('.//blockquote//text()').extract()
Out[7]:
[u'\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n',
u'\nOperating System\n',
u'\nWindows 8.1 64\n',
u'\nDisplay\n']
In [8]: "\n".join(selector.xpath('.//blockquote//text()').extract())
Out[8]: u'\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n\n\nOperating System\n\n\nWindows 8.1 64\n\n\nDisplay\n'
In [9]:
After OP's comment, a good fit would be (//div[@class="content"]/div/blockquote)[1]//text()
Using the OP's original input page:
$ scrapy shell http://forums.redflagdeals.com/dominos-pizza-50-off-july-14th-20th-1505545/
2014-07-16 20:43:45+0200 [scrapy] INFO: Scrapy 0.24.2 started (bot: scrapybot)
2014-07-16 20:43:45+0200 [scrapy] INFO: Optional features available: ssl, http11, boto
2014-07-16 20:43:45+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-07-16 20:43:45+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-07-16 20:43:46+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-07-16 20:43:46+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-07-16 20:43:46+0200 [scrapy] INFO: Enabled item pipelines:
2014-07-16 20:43:46+0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-07-16 20:43:46+0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-07-16 20:43:46+0200 [default] INFO: Spider opened
2014-07-16 20:43:47+0200 [default] DEBUG: Crawled (200) <GET http://forums.redflagdeals.com/dominos-pizza-50-off-july-14th-20th-1505545/> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7f63775b0c10>
[s] item {}
[s] request <GET http://forums.redflagdeals.com/dominos-pizza-50-off-july-14th-20th-1505545/>
[s] response <200 http://forums.redflagdeals.com/dominos-pizza-50-off-july-14th-20th-1505545/>
[s] settings <scrapy.settings.Settings object at 0x7f6377c4fd90>
[s] spider <Spider 'default' at 0x7f6376d52bd0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: response.xpath('//div[@class="content"]/div/blockquote')
Out[1]:
[<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>]
In [2]: response.xpath('(//div[@class="content"]/div/blockquote)[1]')
Out[2]: [<Selector xpath='(//div[@class="content"]/div/blockquote)[1]' data=u'<blockquote class="postcontent restore "'>]
In [3]: response.xpath('(//div[@class="content"]/div/blockquote)[1]//text()')
Out[3]:
[<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n\t\t\t\tGot a coupon that stated 50% off a'>,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\nCode is CAG5014'>,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\nDeal is on! '>,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u"Don't Forget to tip driver!!">,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n\t\t\t'>]
In [4]: response.xpath('string((//div[@class="content"]/div/blockquote)[1])').extract()
Out[4]: [u"\r\n\t\t\t\tGot a coupon that stated 50% off any pizza at menu price. \r\n\r\nCode is CAG5014\r\n\r\nDeal is on! \r\n\r\nDon't Forget to tip driver!!\r\n\r\n\r\n\t\t\t"]
In [5]: response.xpath('normalize-space((//div[@class="content"]/div/blockquote)[1])').extract()
Out[5]: [u"Got a coupon that stated 50% off any pizza at menu price. Code is CAG5014 Deal is on! Don't Forget to tip driver!!"]
In [6]:
Upvotes: 1