Reputation: 555
So I am trying to build a very basic scraper that pulls information from my server, using that information it creates a link which it then yields a request for, after parsing that, it grabs a single link from the parsed page, uploads it back to the server using a get request. The problem i am encountering is that it will pull info from the server, create a link, and then yield the request, and depending on the response time there (which is unreliably consistent) it will dump out and start over with another get request to the server. The way that my server logic is designed is that it is pulling the next data set that needs worked on, and until a course of action is decided with this data set, it will continuously try to pull it and parse it. I am fairly new to scrapy and in need of assistance. I know that my code is wrong but I haven't been able to come up with another method of approach without changing a lot of server code and creating unnecessary hassle, and I am not super savvy with scrapy, or python unfortunately My start requests method:
name = "scrapelevelone"
start_urls = []
def start_requests(self):
print("Start Requests is initiatied")
while True:
print("Were looping")
r = requests.get('serverlink.com')
print("Sent request")
pprint(r.text)
print("This is the request response text")
print("Now try to create json object: ")
try:
personObject = json.loads(r.text)
print("Made json object: ")
pprint(personObject)
info = "streetaddress=" + '+'.join(personObject['address1'].split(" ")) + "&citystatezip=" + '+'.join(personObject['city'].split(" ")) + ",%20" + personObject['state'] + "%20" + personObject['postalcodeextended']
nextPage = "https://www.webpage.com/?" + info
print("Creating info")
newRequest = scrapy.Request(nextPage, self.parse)
newRequest.meta['item'] = personObject
print("Yielding request")
yield newRequest
except Exception:
print("Reach JSON exception")
time.sleep(10)
And everytime the parse function gets called it does all the logic, creates a request.get statement at the end and it's supposed to send data to the server. And it all does what is supposed to if it gets to the end. I tried a lot of different things to try and get the scraper to loop and constantly request to the server for more information. I want the scraper to run indefinitely but that defeats the purpose when I can't step away from the computer because it chokes on a request. Any recommendations for keeping the scraper running 24/7 without using the stupid while loop in the start_requests function? And on top of that, can anyone tell me why it gets stuck in a loop of requests? :( I have a huge headache trying to troubleshoot this and finally gave in to a forum...
Upvotes: 0
Views: 915
Reputation: 21446
What you should do is start with your server url and keep retrying it constantly by yielding Request objects. If data you have is new then parse it and schedule your requests:
class MyCrawler:
start_urls = ['http://myserver.com']
past_data = None
def parse(self, response):
data = json.loads(response.body_as_unicode())
if data == past_data: # if data is the same, retry
# time.sleep(10) # you can add delay but sleep will stop everything
yield Request(response.url, dont_filter=True, priority=-100)
return
past_data = data
for url in data['urls']:
yield Request(url, self.parse_url)
# keep retrying
yield Request(response.url, dont_filter=True, priority=-100)
def parse_url(self, repsonse):
#...
yield {'scrapy': 'item'}
Upvotes: 2