Reputation: 1620
This is probably very conceptual question to ask, and Stack overflow has a wealth of resources on scrapy and building Xpaths - but I did not find anything that answers this specifically, so asking.
While building my XPath expressions for Scrapy (in python) using Firebug & XPath checker (independently) - I see two different ways to build my Xpaths. I know that for a particular Xpath/HTML hierarchy, there can be many possible ways of building an XPath, to be able to extract/scrape elements of interest. I also understand that you may generate either an absolute/relative Xpath (in Firepath)
More Specifically -
Sample Usecase -- Trying to scrape a page on ebay
scrapy shell http://www.ebay.com/sch/Coats-Jackets-/57988/i.html
--Using Xpath checker-- [Works ok, after removing tbody from the XPath]]
Xpath = id('ResultSetItems')/table/tbody/tr/td/div/div/div/div/div/h4/a/text() hxs.select("id('ResultSetItems')/table/tr/td/div/div/div/div/div/h4/a/text()").extract()
-- Using relative path in Firepath -- [works, ok, after removing tbody from the XPath]
XPath = .//[@id='ResultSetItems']/table[1]/tbody/tr/td[1]/div/div/div/div/div[2]/h4/a/@href hxs.select(".//[@id='ResultSetItems']/table[1]/tr/td[1]/div/div/div/div/div[2]/h4/a/@href").extract()
-- Using absolute path in Firepath -- [Does not work, even after removing tbody from the XPath]
XPath = =html/body/div[5]/div[2]/div[3]/div[1]/div/div/div[2]/div/div[6]/div/table[1]/tbody/tr/td[1]/div/div/div/div/div[2]/h4/a/@href hxs.select("html/body/div[5]/div[2]/div[3]/div[1]/div/div/div[2]/div/div[6]/div/table[1]/t>r/td[1]/div/div/div/div/div[2]/h4/a/@href").extract() does not work, even after removing tbody
Note that I see the response only after I explicitly remove the "tbody" from XPath , but this does not hold true for absolute paths generated via Firepath.
Q1: Why do I need to remove "tbody" and if there are other such elements that firefox appends/inserts in the middle of the XPath, besides tbody that I should remove before trying to fetch responses(using hxs.select)/build my item pipeline.
A possible explanation I found : "Firefox, in particular, is known for adding elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use in your XPath expressions. " Source : Firefox, see also : Parsing HTML with XPath, Python and Scrapy
Q2: When reading an absolute path in FirePath pane, the response does not work even after removing tbody - Why is that so ?
Q3 : Is there a best practice on which of the two between Firebug & XPath checker works better(read more robust/consistent) - and if yes, why and which one ?
Q4 Unrelated : Some people recommend disabling Javascript on the browser while building your XPaths, is this related and is disabling the JavaScript a standard practice ? What are the repercussions of not doing so, while scraping (if any) ?
Related - Xpath Table Within Table Parsing HTML with XPath, Python and Scrapy
Upvotes: 2
Views: 2154
Reputation: 473983
Q1
Adding tbody
tag by the browser is a way of following HTML4 specification:
<!ELEMENT TABLE - -
(CAPTION?, (COL*|COLGROUP*), THEAD?, TFOOT?, TBODY+)>
<!ATTLIST TABLE -- table element --
%attrs; -- %coreattrs, %i18n, %events --
summary %Text; #IMPLIED -- purpose/structure for speech output--
width %Length; #IMPLIED -- table width --
border %Pixels; #IMPLIED -- controls frame width around table --
frame %TFrame; #IMPLIED -- which parts of frame to render --
rules %TRules; #IMPLIED -- rulings between rows and cols --
cellspacing %Length; #IMPLIED -- spacing between cells --
cellpadding %Length; #IMPLIED -- spacing within cells --
>
In other words, tr
element cannot be a direct child of table
by the specification. A browser inserts tbody
when it sees it's missing. HTML5, on the other hand, allows this to be. Browsers are just keeping it for backwards compatibility now.
See also:
Usually, you should just take care of tbody
, but, in theory, browser can change/inject/fix an html of a page in order to make it work.
Q2
This particular ebay page uses js to load the content dynamically. Scrapy sees only 1 div in the "Body" div (div[5]):
>>> hxs.select('//html/body/div[5]/div').extract()
[u'<div id="showDiagContainer">\r\n</div>']
Other divs are loaded via ajax requests. Usually this is a problem for Scrapy that you have to deal with and find workarounds. Two options: simulate these ajax calls in the spider or switch to in-browser tools like selenium (or combine it with Scrapy).
See also:
Q3
This is really up to you. If you'll continue to use Scrapy - then, just construct xpath in the Xpath checker or FirePath and check it in scrapy shell - this is important.
Q4
I'll answer this way: disable JS and load this ebay page - do you see anything you wanted to scrape?
Hope that helps.
Upvotes: 1