Reputation: 359
I'm 99% sure something is going on with my hxs.select
on this website. I cannot extract anything. When I run the following code, I don't get any error feedback. title
or link
doesn't get populated. Any help?
def parse(self, response):
self.log("\n\n\n We got data! \n\n\n")
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class=\'footer\']')
items = []
for site in sites:
item = CarrierItem()
item['title'] = site.select('.//a/text()').extract()
item['link'] = site.select('.//a/@href').extract()
items.append(item)
return items
Is there a way I can debug this? I also tried to use the scrapy shell
command with an url but when I input view(response)
in the shell it simply returns True
and a text file opens instead of my Web Browser.
>>> response.url 'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp' >>> hxs.select('//div') Traceback (most recent call last): File "", line 1, in AttributeError: 'NoneType' object has no attribute 'select' >>> view(response) True >>> hxs.select('//body') Traceback (most recent call last): File "", line 1, in AttributeError: 'NoneType' object has no attribute 'select'
Upvotes: 4
Views: 3112
Reputation: 175
To use pdb to debug a scrapy spider, you need to insert a debugging point and include some code to make turn it on and off:
To make this spider debuggable with pdb, you can add the following code:
# -*- coding: utf-8 -*-
import scrapy
import os
import pdb
class QuotesSpiderSpider(scrapy.Spider):
name = 'simple'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def __init__(self):
scrapyDebug = os.getenv("SCRAPY_DEBUG")
if scrapyDebug and int(scrapyDebug):
pdb.set_trace()
def parse(self, response):
quotes = response.xpath("//div[@class='quote']//span[@class='text']/text()").extract()
yield {'quotes': quotes}
So running the spider normally doesn't invoke the debugger. If you have a bug you need to de-, you invoke the debugger like this:
SCRAPY_DEBUG=1 scrapy crawl simple
In the spider init() method the debugger will be started. You can then set breakpoints in places in the code where you've had issues.
Upvotes: 0
Reputation: 596
$ which scrapy
/Users/whatever/tutorial/tutorial/env/bin/scrapy
For me it was at /Users/whatever/tutorial/tutorial/env/bin/scrapy
, copy that path.
Go to the debug tab in VSCode and click "Add configuration"
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Python: Current File",
"args": ["crawl", "NAME_OF_SPIDER"],
"type": "python",
"request": "launch",
"program": "PATH_TO_SCRAPY_FILE",
"console": "integratedTerminal",
"justMyCode": false
}
]
}
In that template replace NAME_OF_SPIDER
with the name of your spider (in my case datasets
). And PATH_TO_SCRAPY_FILE
with the output which you got in step 1. (in my case /Users/whatever/tutorial/tutorial/env/bin/scrapy
).
Upvotes: 2
Reputation: 11975
You can use pdb from the command line and add a breakpoint in your file. But it might involve some steps.
(It may differ slightly for windows debugging)
Locate your scrapy
executable:
$ whereis scrapy
/usr/local/bin/scrapy
Call it as python script and start pdb
$ python -m pdb /usr/local/bin/scrapy crawl quotes
Once in the debugger shell, open another shell instance and locate the path to your spider script (residing in your spider project)
$ realpath path/to/your/spider.py
/absolute/spider/file/path.py
This will output the absolute path. Copy it to your clipboard.
In the pdb shell type:
b /absolute/spider/file/path.py:line_number
...where line number is the desired point to break when debugging that file.
c
in the debugger...Now go do some PythonFu :)
Upvotes: 3
Reputation: 20748
Scrapy shell is a good tool for that indeed. And if your document has an XML stylesheet, it's probably an XML document. So you can use scrapy shell with xxs
instead of hxs
as in this Scrapy documentation example about removing namespaces:
http://doc.scrapy.org/en/latest/topics/selectors.html#removing-namespaces
When that doesn't work, I tend to go back to pure lxml.etree and dump the whole document's elements:
import lxml.etree
import lxml.html
class myspider(BaseSpider):
...
def parse(self, response):
self.log("\n\n\n We got data! \n\n\n")
root = lxml.etree.fromstring(response.body).getroot()
# or for broken XML docs:
# root = lxml.etree.fromstring(response.body, parser = lxml.etree.XMLParser(recover=True)).getroot()
# or for HTML:
# root = lxml.etree.fromstring(response.body, parser=lxml.html.HTMLParser()).getroot()
# and then lookup what are the actual elements I can select
print list(root.iter()) # this could be very big, but at least you all what's inside, the element tags and namespaces
Upvotes: 1