Reputation: 359

How can I debug Scrapy?

I'm 99% sure something is going on with my hxs.select on this website. I cannot extract anything. When I run the following code, I don't get any error feedback. title or link doesn't get populated. Any help?

def parse(self, response):
    self.log("\n\n\n We got data! \n\n\n")
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//div[@class=\'footer\']')
    items = []
    for site in sites:
        item = CarrierItem()
        item['title'] = site.select('.//a/text()').extract()
        item['link'] = site.select('.//a/@href').extract()
        items.append(item)
    return items

Is there a way I can debug this? I also tried to use the scrapy shell command with an url but when I input view(response) in the shell it simply returns True and a text file opens instead of my Web Browser.

>>> response.url
'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp'

>>> hxs.select('//div')
Traceback (most recent call last):
    File "", line 1, in 
AttributeError: 'NoneType' object has no attribute 'select'

>>> view(response)
True

>>> hxs.select('//body')
Traceback (most recent call last):
    File "", line 1, in 
AttributeError: 'NoneType' object has no attribute 'select'

Upvotes: 4

Answers (4)

Nufosmatic

Reputation: 175

To use pdb to debug a scrapy spider, you need to insert a debugging point and include some code to make turn it on and off:

This is a very simple spider.

To make this spider debuggable with pdb, you can add the following code:

# -*- coding: utf-8 -*-
import scrapy

import os
import pdb

class QuotesSpiderSpider(scrapy.Spider):
    name = 'simple'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def __init__(self):
        scrapyDebug = os.getenv("SCRAPY_DEBUG")
        if scrapyDebug and int(scrapyDebug):
            pdb.set_trace()

    def parse(self, response):
        quotes = response.xpath("//div[@class='quote']//span[@class='text']/text()").extract()
        yield {'quotes': quotes}

So running the spider normally doesn't invoke the debugger. If you have a bug you need to de-, you invoke the debugger like this:

SCRAPY_DEBUG=1 scrapy crawl simple

In the spider init() method the debugger will be started. You can then set breakpoints in places in the code where you've had issues.

Upvotes: 0

0xZ3RR0

Reputation: 596

Using VSCode:

1. Locate where your scrapy executable is:

$ which scrapy
/Users/whatever/tutorial/tutorial/env/bin/scrapy

For me it was at /Users/whatever/tutorial/tutorial/env/bin/scrapy, copy that path.

2. Create a launch.json file

Go to the debug tab in VSCode and click "Add configuration"

3. Paste the following template into the launch.json

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python: Current File",
            "args": ["crawl", "NAME_OF_SPIDER"],
            "type": "python",
            "request": "launch",
            "program": "PATH_TO_SCRAPY_FILE",
            "console": "integratedTerminal",
            "justMyCode": false
        }
    ]
}

In that template replace NAME_OF_SPIDER with the name of your spider (in my case datasets). And PATH_TO_SCRAPY_FILE with the output which you got in step 1. (in my case /Users/whatever/tutorial/tutorial/env/bin/scrapy).

4. Check that VSCode was opened at the root of your scrapy project

5. Set a breakpoint and click debug!

Upvotes: 2

deostroll

Reputation: 11975

You can use pdb from the command line and add a breakpoint in your file. But it might involve some steps.

(It may differ slightly for windows debugging)

Locate your scrapy executable:
```
$ whereis scrapy
/usr/local/bin/scrapy
```

Call it as python script and start pdb

$ python -m pdb /usr/local/bin/scrapy crawl quotes

Once in the debugger shell, open another shell instance and locate the path to your spider script (residing in your spider project)
```
$ realpath path/to/your/spider.py
/absolute/spider/file/path.py
```

This will output the absolute path. Copy it to your clipboard.

In the pdb shell type:

b /absolute/spider/file/path.py:line_number

...where line number is the desired point to break when debugging that file.

Hit c in the debugger...

Now go do some PythonFu :)

Upvotes: 3

paul trmbrth

Reputation: 20748

Scrapy shell is a good tool for that indeed. And if your document has an XML stylesheet, it's probably an XML document. So you can use scrapy shell with xxs instead of hxs as in this Scrapy documentation example about removing namespaces: http://doc.scrapy.org/en/latest/topics/selectors.html#removing-namespaces

When that doesn't work, I tend to go back to pure lxml.etree and dump the whole document's elements:

import lxml.etree
import lxml.html

class myspider(BaseSpider):
    ...
    def parse(self, response):
        self.log("\n\n\n We got data! \n\n\n")
        root = lxml.etree.fromstring(response.body).getroot()
        # or for broken XML docs:
        # root = lxml.etree.fromstring(response.body, parser = lxml.etree.XMLParser(recover=True)).getroot()
        # or for HTML:
        # root = lxml.etree.fromstring(response.body, parser=lxml.html.HTMLParser()).getroot()

        # and then lookup what are the actual elements I can select
        print list(root.iter()) # this could be very big, but at least you all what's inside, the element tags and namespaces

Upvotes: 1