Reputation: 125
I am trying to get scrapy to parse the links on a page to scrape. Unfortunatly the links on this page are enclosed in a JavaScript onclick function. I would like to use the SgmlLinkExtractor rule to extract the link to parse the JavaScript and create the URL to use with callback='parse_item' if possible.
Here is an example of the each link with the JS function:
<a onclick="window.open('page.asp?ProductID=3679','productwin','width=700,height=475,scrollbars,resizable,status');" href="#internalpagelink">Link Text</a>
I just need the link extractor to send to callback parse_item: http://domain.com/page.asp?ProductID=3679
How would i write CrawlSpider rules to do this?
If this is not possible what would be the best way to end up being able to parse all pages embeded in this format of JavaScript links on a defined set of start pages?
Thank you all.
Upvotes: 3
Views: 5333
Reputation: 20748
You can use the attrs
parameter of SgmlLinkExtractor.
- attrs (list) – list of attributes which should be considered when looking for links to extract (only for those tags specified in the tags parameter). Defaults to ('href',)
and process_value
parameter from BaseSgmlLinkExtractor:
- process_value (callable) – a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return None to ignore the link altogether. If not given, process_value defaults to lambda x: x.
So you would write a parsing function for "onclick" attributes' values:
def process_onclick(value):
m = re.search("window.open\('(.+?)'", value)
if m:
return m.group(1)
Let's check that regular expression:
>>> re.search("window.open\('(.+?)'",
... "window.open('page.asp?ProductID=3679','productwin','width=700,height=475,scrollbars,resizable,status');"
... ).group(1)
'page.asp?ProductID=3679'
>>>
And then use it in a Rule
with SgmlLinkExtractor
rules=(
Rule(SgmlLinkExtractor(allow=(),
attrs=('onclick',),
process_value=process_onclick),
callback='parse_item'),
)
Upvotes: 6
Reputation: 9816
Maybe BaseSpider
is more proper than CrawlSpider
here.
You could extracted the link you want to crawl and excapulate it in a Request
object and then emit this Request object
, like the following:
def parse_xxx(self, response):
"""do some work"""
req_objs = []
req_objs.append(Request(SOME_URL, SOME_ARGS))
"""Add more `Request` objects"""
for req in req_objs:
yield req
Upvotes: 0