davegallant
davegallant

Reputation: 533

xpath in python does not grab entire HTML block

I am using scrapy to scrape information off of a website. My xpath is working but it does not grab information from block.

Python code:

sel.xpath('//div[@class="content"]/div/blockquote/node()').extract()[0]

I am using this to grab the first blockquote on the page. It cuts off after there is a <br>.

For example:

If I can see this:

<blockquote class="postcontent restore ">
4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)
<br>
Operating System
<br>
Windows 8.1 64
<br>
Display
</blockquote>

It will only return:

4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)

But I would prefer it return everything, including html tags and the rest of the text in blockquote.

Upvotes: 0

Views: 372

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

//div[@class="content"]/div/blockquote/node() will get you all nodes just under a blockquote, children text nodes and element nodes.

In your case, you'll get the text nodes and the <br>s

sel.xpath('//div[@class="content"]/div/blockquote/node()').extract()[0] will extract only the 1st node, which is the text node with "4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)"

Here's a sample ipython session to show different outputs using selectors:

$ ipython
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
Type "copyright", "credits" or "license" for more information.

IPython 1.2.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import scrapy

In [2]: selector = scrapy.selector.Selector(text="""<blockquote class="postcontent restore ">
   ...: 4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)
   ...: <br>
   ...: Operating System
   ...: <br>
   ...: Windows 8.1 64
   ...: <br>
   ...: Display
   ...: </blockquote>""")

In [3]: selector.xpath('blockquote/node()').extract()
Out[3]: []

In [4]: selector.xpath('.//blockquote/node()').extract()
Out[4]: 
[u'\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n',
 u'<br>',
 u'\nOperating System\n',
 u'<br>',
 u'\nWindows 8.1 64\n',
 u'<br>',
 u'\nDisplay\n']

In [5]: selector.xpath('.//blockquote').extract()
Out[5]: [u'<blockquote class="postcontent restore ">\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n<br>\nOperating System\n<br>\nWindows 8.1 64\n<br>\nDisplay\n</blockquote>']

In [6]: selector.xpath('string(.//blockquote)').extract()
Out[6]: [u'\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n\nOperating System\n\nWindows 8.1 64\n\nDisplay\n']

In [7]: selector.xpath('.//blockquote//text()').extract()
Out[7]: 
[u'\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n',
 u'\nOperating System\n',
 u'\nWindows 8.1 64\n',
 u'\nDisplay\n']

In [8]: "\n".join(selector.xpath('.//blockquote//text()').extract())
Out[8]: u'\n4th Generation Intel Core i7-4710HQ Processor (2.50GHz 1600MHz 6MB)\n\n\nOperating System\n\n\nWindows 8.1 64\n\n\nDisplay\n'

In [9]: 

After OP's comment, a good fit would be (//div[@class="content"]/div/blockquote)[1]//text()

Using the OP's original input page:

$ scrapy shell http://forums.redflagdeals.com/dominos-pizza-50-off-july-14th-20th-1505545/
2014-07-16 20:43:45+0200 [scrapy] INFO: Scrapy 0.24.2 started (bot: scrapybot)
2014-07-16 20:43:45+0200 [scrapy] INFO: Optional features available: ssl, http11, boto
2014-07-16 20:43:45+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-07-16 20:43:45+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-07-16 20:43:46+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-07-16 20:43:46+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-07-16 20:43:46+0200 [scrapy] INFO: Enabled item pipelines: 
2014-07-16 20:43:46+0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-07-16 20:43:46+0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-07-16 20:43:46+0200 [default] INFO: Spider opened
2014-07-16 20:43:47+0200 [default] DEBUG: Crawled (200) <GET http://forums.redflagdeals.com/dominos-pizza-50-off-july-14th-20th-1505545/> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f63775b0c10>
[s]   item       {}
[s]   request    <GET http://forums.redflagdeals.com/dominos-pizza-50-off-july-14th-20th-1505545/>
[s]   response   <200 http://forums.redflagdeals.com/dominos-pizza-50-off-july-14th-20th-1505545/>
[s]   settings   <scrapy.settings.Settings object at 0x7f6377c4fd90>
[s]   spider     <Spider 'default' at 0x7f6376d52bd0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: response.xpath('//div[@class="content"]/div/blockquote')
Out[1]: 
[<Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>,
 <Selector xpath='//div[@class="content"]/div/blockquote' data=u'<blockquote class="postcontent restore "'>]

In [2]: response.xpath('(//div[@class="content"]/div/blockquote)[1]')
Out[2]: [<Selector xpath='(//div[@class="content"]/div/blockquote)[1]' data=u'<blockquote class="postcontent restore "'>]

In [3]: response.xpath('(//div[@class="content"]/div/blockquote)[1]//text()')
Out[3]: 
[<Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n\t\t\t\tGot a coupon that stated 50% off a'>,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\nCode is CAG5014'>,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\nDeal is on! '>,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u"Don't Forget to tip driver!!">,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n'>,
 <Selector xpath='(//div[@class="content"]/div/blockquote)[1]//text()' data=u'\r\n\t\t\t'>]

In [4]: response.xpath('string((//div[@class="content"]/div/blockquote)[1])').extract()
Out[4]: [u"\r\n\t\t\t\tGot a coupon that stated 50% off any pizza at menu price. \r\n\r\nCode is CAG5014\r\n\r\nDeal is on! \r\n\r\nDon't Forget to tip driver!!\r\n\r\n\r\n\t\t\t"]

In [5]: response.xpath('normalize-space((//div[@class="content"]/div/blockquote)[1])').extract()
Out[5]: [u"Got a coupon that stated 50% off any pizza at menu price. Code is CAG5014 Deal is on! Don't Forget to tip driver!!"]

In [6]: 

Upvotes: 1

Related Questions