Reputation: 57451
I'm trying to scrape APKmirror.com download pages such as http://www.apkmirror.com/apk/shareit-technologies-co-ltd/shareit-connect-transfer/shareit-3-0-38_ww-release/shareit-3-0-38_ww-android-apk-download/ in a reliable manner.
I've started the Scrapy shell from the command line with
scrapy shell http://www.apkmirror.com/apk/shareit-technologies-co-ltd/shareit-connect-transfer/shareit-3-0-38_ww-release/shareit-3-0-38_ww-android-apk-download/
I'm currently trying to scrape the developer name, app name, and version name from the top navigation bar:
which in this case are "SHAREit Technologies Co.Ltd", "SHAREit - Transfer & Share", and "3.0.38_ww", respectively.
So far I've come up with the following XPath expression for the developer name:
In [84]: response.xpath('//*[@class="site-header-contents"]//nav//a/text()').extract()[0]
Out[84]: u'SHAREit Technologies Co.Ltd'
For the app and version names I would replace [0]
with [1]
and [2]
, respectively. The problem is that using numerical indices is not considered good scraping practice.
Rather, I'd like to use the 'real' distinguishing feature between these links: the fact that their URLs contain different numbers of slashes (/
). I would like to define a custom selector which matches the a/@href
against a regular expression and if it matches, returns the a/text()
, but I wasn't able to figure out how to do this. (For example, the re
method (https://doc.scrapy.org/en/0.10.3/topics/selectors.html#scrapy.selector.XPathSelector.re) seems to be usable as a substitute for extract()
, but not to 'aid' the selection process).
How can I select based on a custom function applied to the @href
s?
Upvotes: 0
Views: 1021
Reputation: 21436
First of all it's not necessarily a bad practice in this case if you're extracting data from the breadcrumbs. You can guarantee that the breadcrumb order will always be the same - first item is the company, second is the product and last is the version - pretty predictable!
Nonetheless you might want to use more reliable xpath indexing instead:
"//div/a[1]"
# would get first <a> node under <div>
"//div/a[last()]"
# would get last <a> node under <div>
However, to answer your question, there is re:test
xpath evaluator which allows you to test something with regular expresions.
Find <div>
node that has <a>
node child with .com href:
"//div[re:test(a/@href, '.+?\.com')]"
Find <div>
node that contains some text insensitive regex match:
"//div[re:test(.//text(), 'foo.bar', 'i')]"
Upvotes: 2