Reputation: 28809
I am working on Scrapy 0.20 with Python 2.7. I found PyCharm has a good Python debugger. I want to test my Scrapy spiders using it. Anyone knows how to do that please?
File->Setting->Project structure->Add content root.
But I don't know what else I have to do
Upvotes: 115
Views: 52659
Reputation: 175
I like to have the python script call pdb based upon PYTHON_DEBUG environment variable when the spider instances:
. . .
# include imports
import pdb
import traceback
. . .
# include an __init__ that attaches to pdb if SCRAPY_DEBUG is defined
class SourceDataSpider(scrapy.Spider):
. . .
def __init__(self):
self.debug = 0
self.date = os.getenv("DATE")
if os.getenv("SCRAPY_DEBUG") and int(os.getenv("SCRAPY_DEBUG")):
self.debug = int(os.getenv("SCRAPY_DEBUG"))
pdb.set_trace()
. . .
Upvotes: 0
Reputation: 342
Might be a bit late, but maybe it helps somebody:
Since the latest PyCharm-versions it's actually pretty straight forward, you can call Scrapy directly - see attached picture of runtime config (Scrapy tutorial).
Tested with PyCharm 2022.1.4.
Upvotes: 4
Reputation: 59654
I am also using PyCharm, but I am not using its built-in debugging features.
For debugging I am using ipdb
. I set up a keyboard shortcut to insert import ipdb; ipdb.set_trace()
on any line I want the break point to happen.
Then I can type n
to execute the next statement, s
to step into a function, type any object name to see its value, alter execution environment, type c
to continue execution...
This is very flexible, works in environments other than PyCharm, where you don't control the execution environment.
Just type in your virtual environment pip install ipdb
and place import ipdb; ipdb.set_trace()
on a line where you want the execution to pause.
UPDATE
You can also pip install pdbpp
and use the standard import pdb; pdb.set_trace
instead of ipdb
. PDB++ is nicer in my opinion.
Upvotes: 3
Reputation: 644
Extending @Rodrigo's version of the answer I added this script and now I can set spider name from configuration instead of changing in the string.
import sys
from scrapy import cmdline
cmdline.execute(f"scrapy crawl {sys.argv[1]}".split())
Upvotes: 0
Reputation: 10666
I use this simple script:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl('your_spider_name')
process.start()
Upvotes: 2
Reputation: 1182
As of 2018.1 this became a lot easier. You can now select Module name
in your project's Run/Debug Configuration
. Set this to scrapy.cmdline
and the Working directory
to the root dir of the scrapy project (the one with settings.py
in it).
Like so:
Now you can add breakpoints to debug your code.
Upvotes: 51
Reputation: 160
According to the documentation https://doc.scrapy.org/en/latest/topics/practices.html
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
Upvotes: 3
Reputation: 801
intellij idea also work.
create main.py:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#coding=utf-8
import sys
from scrapy import cmdline
def main(name):
if name:
cmdline.execute(name.split())
if __name__ == '__main__':
print('[*] beginning main thread')
name = "scrapy crawl stack"
#name = "scrapy crawl spa"
main(name)
print('[*] main thread exited')
print('main stop====================================================')
show below:
Upvotes: 6
Reputation: 2735
The scrapy
command is a python script which means you can start it from inside PyCharm.
When you examine the scrapy binary (which scrapy
) you will notice that this is actually a python script:
#!/usr/bin/python
from scrapy.cmdline import execute
execute()
This means that a command like
scrapy crawl IcecatCrawler
can also be executed like this: python /Library/Python/2.7/site-packages/scrapy/cmdline.py crawl IcecatCrawler
Try to find the scrapy.cmdline package.
In my case the location was here: /Library/Python/2.7/site-packages/scrapy/cmdline.py
Create a run/debug configuration inside PyCharm with that script as script. Fill the script parameters with the scrapy command and spider. In this case crawl IcecatCrawler
.
Like this:
Put your breakpoints anywhere in your crawling code and it should work™.
Upvotes: 206
Reputation: 1092
I am running scrapy in a virtualenv with Python 3.5.0 and setting the "script" parameter to /path_to_project_env/env/bin/scrapy
solved the issue for me.
Upvotes: 12
Reputation: 1638
To add a bit to the accepted answer, after almost an hour I found I had to select the correct Run Configuration from the dropdown list (near the center of the icon toolbar), then click the Debug button in order to get it to work. Hope this helps!
Upvotes: 3
Reputation: 1528
You just need to do this.
Create a Python file on crawler folder on your project. I used main.py.
Inside your main.py put this code below.
from scrapy import cmdline
cmdline.execute("scrapy crawl spider".split())
And you need to create a "Run Configuration" to run your main.py.
Doing this, if you put a breakpoint at your code it will stop there.
Upvotes: 125