Beka Tomashvili
Beka Tomashvili

Reputation: 2311

Scrapy crawler in Cron job

I want to execute my scrapy crawler from cron job .

i create bash file getdata.sh where scrapy project is located with it's spiders

#!/bin/bash
cd /myfolder/crawlers/
scrapy crawl my_spider_name

My crontab looks like this , I want to execute it in every 5 minute

 */5 * * * * sh /myfolder/crawlers/getdata.sh 

but it don't works , whats wrong , where is my error ?

when I execute my bash file from terminal sh /myfolder/crawlers/getdata.sh it works fine

Upvotes: 25

Views: 18183

Answers (8)

Archer0730
Archer0730

Reputation: 1

I run my scrapy spider on a raspberry pi, OS (Debian version: 11 (bullseye)). The following settings/workflow worked for me:

First cd to your project directory. Install scrapy in a venv environment using:

python3 -m venv ./venv
source ./venv/bin/activate
sudo pip3 install scrapy

Create your spiders.

Create the shell file (getdata.sh), use full directory paths (including /home/username/etc..):

#!/bin/bash
#activate virtual environment
source "/full/path/to/project/venv/bin/activate"

#move to the project directory 
cd /full/path/to/project/

#start spider
scrapy crawl my_spider_name

Schedule the spider in crontab using the following line in crontab -e:

   */5 * * * * /full/path/to/shfile/getdata.sh 

Upvotes: 0

Oni
Oni

Reputation: 1153

Check where scrapy is installed using "which scrapy" command. In my case, scrapy is installed in /usr/local/bin.

Open crontab for editing using crontab -e. PATH=$PATH:/usr/local/bin export PATH */5 * * * * cd /myfolder/path && scrapy crawl spider_name

It should work. Scrapy runs every 5 minutes.

Upvotes: 1

Nikulsinh
Nikulsinh

Reputation: 16

in my case scrapy is in .local/bin/scrapy give the proper path of scraper and name it worK perfect

0 0 * * * cd /home/user/scraper/Folder_of_scriper/ && /home/user/.local/bin/scrapy crawl "name" >> /home/user/scrapy.log 2>&1

/home/user/scrapy.log it use to save the output and error in scrapy.log for check it program work or not

thank you.

Upvotes: 0

nottmey
nottmey

Reputation: 333

For anyone who used pip3 (or similar) to install scrapy, here is a simple inline solution:

*/10 * * * * cd ~/project/path && ~/.local/bin/scrapy crawl something >> ~/crawl.log 2>&1

Replace:

*/10 * * * * with your cron pattern

~/project/path with the path to your scrapy project (where your scrapy.cfg is)

something with the spider name (use scrapy list in your project to find out)

~/crawl.log with your log file position (in case you want to have logging)

Upvotes: 6

NFern
NFern

Reputation: 2026

Adding the following lines in crontab -e runs my scrapy crawl at 5AM every day. This is a slightly modified version of crocs' answer

PATH=/usr/bin
* 5 * * * cd project_folder/project_name/ && scrapy crawl spider_name

Without setting $PATH, cron would give me an error "command not found: scrapy". I guess this is because /usr/bin is where scripts to run programs are stored in Ubuntu.

Note that the complete path for my scrapy project is /home/user/project_folder/project_name. I ran the env command in cron and noticed that the working directory is /home/user. Hence I skipped /home/user in my crontab above

The cron log can be helpful while debugging

grep CRON /var/log/syslog

Upvotes: 13

croc
croc

Reputation: 155

Another option is to forget using a shell script and chain the two commands together directly in the cronjob. Just make sure the PATH variable is set before the first scrapy cronjob in the crontab list. Run:

    crontab -e 

to edit and have a look. I have several scrapy crawlers which run at various times. Some every 5 mins, others twice a day.

    PATH=/usr/local/bin
    */5 * * * * user cd /myfolder/crawlers/ && scrapy crawl my_spider_name_1
    * 1,13 * * * user cd /myfolder/crawlers/ && scrapy crawl my_spider_name_2

All jobs located after the PATH variable will find scrapy. Here the first one will run every 5 mins and the 2nd twice a day at 1am and 1pm. I found this easier to manage. If you have other binaries to run then you may need to add their locations to the path.

Upvotes: 3

Beka Tomashvili
Beka Tomashvili

Reputation: 2311

I solved this problem including PATH into bash file

#!/bin/bash

cd /myfolder/crawlers/
PATH=$PATH:/usr/local/bin
export PATH
scrapy crawl my_spider_name

Upvotes: 34

KeepCalmAndCarryOn
KeepCalmAndCarryOn

Reputation: 9075

does your shell script have execute permission?

e.g. can you do

  /myfolder/crawlers/getdata.sh 

without the sh?

if you can then you can drop the sh in the line in cron

Upvotes: 0

Related Questions