Kurt Peek
Kurt Peek

Reputation: 57451

In Scrapy, how to select based on matching the URL of a link to a regular expression

I'm trying to scrape APKmirror.com download pages such as http://www.apkmirror.com/apk/shareit-technologies-co-ltd/shareit-connect-transfer/shareit-3-0-38_ww-release/shareit-3-0-38_ww-android-apk-download/ in a reliable manner.

I've started the Scrapy shell from the command line with

scrapy shell http://www.apkmirror.com/apk/shareit-technologies-co-ltd/shareit-connect-transfer/shareit-3-0-38_ww-release/shareit-3-0-38_ww-android-apk-download/

I'm currently trying to scrape the developer name, app name, and version name from the top navigation bar:

enter image description here

which in this case are "SHAREit Technologies Co.Ltd", "SHAREit - Transfer & Share", and "3.0.38_ww", respectively.

So far I've come up with the following XPath expression for the developer name:

In [84]: response.xpath('//*[@class="site-header-contents"]//nav//a/text()').extract()[0]
Out[84]: u'SHAREit Technologies Co.Ltd'

For the app and version names I would replace [0] with [1] and [2], respectively. The problem is that using numerical indices is not considered good scraping practice.

Rather, I'd like to use the 'real' distinguishing feature between these links: the fact that their URLs contain different numbers of slashes (/). I would like to define a custom selector which matches the a/@href against a regular expression and if it matches, returns the a/text(), but I wasn't able to figure out how to do this. (For example, the re method (https://doc.scrapy.org/en/0.10.3/topics/selectors.html#scrapy.selector.XPathSelector.re) seems to be usable as a substitute for extract(), but not to 'aid' the selection process).

How can I select based on a custom function applied to the @hrefs?

Upvotes: 0

Views: 1021

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21436

First of all it's not necessarily a bad practice in this case if you're extracting data from the breadcrumbs. You can guarantee that the breadcrumb order will always be the same - first item is the company, second is the product and last is the version - pretty predictable!
Nonetheless you might want to use more reliable xpath indexing instead:

"//div/a[1]" 
# would get first <a> node under <div>
"//div/a[last()]"
# would get last <a> node under <div>

However, to answer your question, there is re:test xpath evaluator which allows you to test something with regular expresions.

Find <div> node that has <a> node child with .com href:

"//div[re:test(a/@href, '.+?\.com')]"  

Find <div> node that contains some text insensitive regex match:

"//div[re:test(.//text(), 'foo.bar', 'i')]"

Upvotes: 2

Related Questions