kekw
kekw

Reputation: 390

Python3 Scrapy Webcrawler

for my work, i have to write a crawler, which only saves the title of the page, the deliverystatus and the quantity of the product.

here is my default spider code:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
        'https://www.topart-online.com/de/Ahorn-japan.%2C-70cm%2C--36-Blaetter----Herbst/c-KAT282/a-150001HE'
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-1]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

i need an output file which only contains these classes:

title of the product, available quantity and deliverystatus

i dont know how to edit the code, that i got the values printed, in a newfile. i only know how to save the whole page as a new .html file

thanks for your help guys

Upvotes: 0

Views: 72

Answers (1)

AaronS
AaronS

Reputation: 2335

Essentially scrapy extract data based on selectors, here we're using the XPATH selectors but you could use CSS selectors if you wish. Please see here for an introduction.

Here is a bit more information on extracting data from Scrapy in the docs.

What we're doing is yielding a dictionary from the response scrapy gets when grabbing the HTML. The key's are our rows and the values are the columns for each row.

Code Example

def parse(self, response):    
    yield {
           'title': response.xpath('//h1[@class="text-center text-md-left mt-0"]/text()').get(),
           'product': response.xpath('//div[@class="col-6"]/text()')[0].get().strip(),
           'delivery_status': response.xpath('//div[@class="availabilitydeliverytime"]/text()').get().replace('/','').strip()
          }

Explanation

A yield statement returns the values what is called lazily, it's related to return but is much different. I suggest you look up here for further details on the distinction.

The response.xpath() method excepts an XPATH selector and can grab the data. The get() is used to grab this data, only once. If there are multiple html tags that have that XPATH selector then getall() can be used to grab all of the results.

  1. // - Searches the entire HTML
  2. h1 - the tag we want to get data from
  3. [@class=""] - We want to select the h1 tag of class=""
  4. /text() - Grabs the text within the html tag
  5. get() scrapy grabs this result.

In the product, the class= "col-6" had multiple tags in the HTML so we grabbed only the first one, as the response.xpath() returns a list. We used the get() method and then use the strip() to strip any white space.

In delivery status similar to the above but we used the replace() method to get rid of the /.

When you run the scrapy script, use scrapy crawl quotes -o quotes.json if you want it in JSON format. More information here in the docs

You should check out the scrapy tutorial in the docs here. This will be extremely helpful in getting to grips with a basic scraper. Here's we're yielding a dictionary based on XPATH selectors.

Additional Information

For anything but the most structured data, I would suggest you look up Items and ItemLoaders for storing data. These are far more flexible when you run into problems with the data that needs cleaned up. Yielding a dictionary is the simplest of ways to get data from scrapy.

Upvotes: 1

Related Questions