I am using scrapy to scrape data from Yelp. I cannot see any error but data is not getting scraped from the StartURLs mentioned in the spider

Code for the items.py and other files are mentioned below. The logs are also mentioned at the end.I am not getting any error but according to the logs the scrapy has not scraped any pages.

```
import scrapy


class YelpItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    name_url = scrapy.Field()
    rating = scrapy.Field()
    date = scrapy.Field()
    review_text = scrapy.Field()
    user_pic = scrapy.Field()
    city = scrapy.Field()
    is_true = scrapy.Field()
```

code for settings.py

import pathlib
BOT_NAME = 'yelp-scrapy-dev'

SPIDER_MODULES = ['yelp-scrapy-dev.spiders']
NEWSPIDER_MODULE = 'yelp-scrapy-dev.spiders'

{
pathlib.Path('output1.csv'):{
    'format':'csv',
},
}
ROBOTSTXT_OBEY = False

code for pipelines.py

class YelpPipeline:
    def open_spider(self, spider):
        self.file = open('output1.csv', 'w')

    def close_spider(self, spider):
        self.file.close()
        
    def process_item(self, item, spider):
        return item

code for middlewares.py

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class YelpSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class YelpDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

code for city spider. The spider collects the reviews from the specified URL's

import scrapy
from ..items import YelpItem

# currently will grab the first 100 reviews from the first 10 businesses from start url
class CitySpider(scrapy.Spider):

    name = 'city'
    start_urls = [
        'https://www.yelp.com/search?find_desc=&find_loc=Seattle%2C+WA',
        'https://www.yelp.com/search?find_desc=&find_loc=SanFrancisco%2C+CA',
        'https://www.yelp.com/search?find_desc=&find_loc=NewYork%2C+NY',
        'https://www.yelp.com/search?find_desc=&find_loc=Dallas%2C+TX',
        'https://www.yelp.com/search?find_desc=&find_loc=Atlanta%2C+GA',
    ]

    # gets the first 10 businesses from the start url
    def parse(self, response):
        
        business_pages = response.css('.text-weight--bold__373c0__1elNz a')
        yield from response.follow_all(business_pages, self.parse_business)

    # extracts the first 100 reviews from the yelp-scrapy-dev business
    def parse_business(self, response):

        items = YelpItem()
        all_reviews = response.css('.sidebarActionsHoverTarget__373c0__2kfhE')
        
        address = response.request.url.split('?')
        src = address[0].split('/')
        biz = src[-1].split('-')
        loc = biz[-1] if not biz[-1].isdigit() else biz[-2]
        if loc == 'seattle':
            city = 'Seattle, WA'
        elif loc == 'dallas':
            city = 'Dallas, TX'
        elif loc == 'francisco':
            city = 'San Francisco, CA'
        elif loc == 'york':
            city = 'New York, NY'
        elif loc == 'atlanta':
            city = 'Atlanta, GA'
        else:
            city = 'outofrange'

        for review in all_reviews:
            name = review.css('.link-size--inherit__373c0__1VFlE::text').extract_first()
            name_url = review.css('.link-size--inherit__373c0__1VFlE::attr(href)').extract_first().split('=')
            rating = review.css('.overflow--hidden__373c0__2y4YK::attr(aria-label)').extract()
            date = review.css('.arrange-unit-fill__373c0__3Sfw1 .text-color--mid__373c0__jCeOG::text').extract()
            review_text = review.css('.raw__373c0__3rKqk::text').extract()
            user_pic = review.css('.gutter-1__373c0__2l5bx .photo-box-img__373c0__35y5v::attr(src)').extract()

            if city != 'outofrange':
                # making sure data is stored as a str
                items['name'] = name
                items['name_url'] = name_url[1]
                items['rating'] = rating[0]
                items['date'] = date[0]
                items['review_text'] = review_text[0]
                items['user_pic'] = user_pic[0] != 'https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_styleguide/514f6997a318/assets/img/default_avatars/user_60_square.png'
                items['city'] = city
                items['is_true'] = True

                yield items

        source = response.request.url

        # prevent duplicate secondary pages from being recrawled
        if '?start=' not in source:
            # gets 20th-100th reviews, pages are every 20 reviews
            for i in range(1, 5):
                next_page = source + '?start=' + str(i*20)
                yield response.follow(next_page, callback=self.parse_business)

And below are the log lines.

(venv) C:\Users\somar\yelp-scrapy\yelp>scrapy crawl city
2020-10-09 22:34:53 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: yelp-scrapy-dev)
2020-10-09 22:34:53 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7
.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10
.0.18362-SP0
2020-10-09 22:34:53 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-09 22:34:53 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'yelp-scrapy-dev',
 'NEWSPIDER_MODULE': 'yelp-scrapy-dev.spiders',
 'SPIDER_MODULES': ['yelp-scrapy-dev.spiders']}
2020-10-09 22:34:53 [scrapy.extensions.telnet] INFO: Telnet Password: 1f95c571b9245c42
2020-10-09 22:34:53 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-10-09 22:34:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-09 22:34:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-09 22:34:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-09 22:34:54 [scrapy.core.engine] INFO: Spider opened
2020-10-09 22:34:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-09 22:34:54 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-09 22:34:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=Dallas%2C+TX> (referer: None)
2020-10-09 22:34:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=Atlanta%2C+GA> (referer: None)
2020-10-09 22:34:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=Seattle%2C+WA> (referer: None)
2020-10-09 22:34:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=NewYork%2C+NY> (referer: None)
2020-10-09 22:34:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=SanFrancisco%2C+CA> (referer: None)
2020-10-09 22:34:56 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-09 22:34:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1264,
 'downloader/request_count': 5,
 'downloader/request_method_count/GET': 5,
 'downloader/response_bytes': 278234,
 'downloader/response_count': 5,
 'downloader/response_status_count/200': 5,
 'elapsed_time_seconds': 2.159687,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 10, 10, 5, 34, 56, 173193),
 'log_count/DEBUG': 5,
 'log_count/INFO': 10,
 'response_received_count': 5,
 'scheduler/dequeued': 5,
 'scheduler/dequeued/memory': 5,
 'scheduler/enqueued': 5,
 'scheduler/enqueued/memory': 5,
 'start_time': datetime.datetime(2020, 10, 10, 5, 34, 54, 13506)}
2020-10-09 22:34:56 [scrapy.core.engine] INFO: Spider closed (finished)

Upvotes: 1

Answers (2)

Сергей Мельник

Reputation: 86

Your css selectors are not correct. They do not return correct data. For example: business_pages in the parse generator is an empty list as a result of parsing of next pages urls.

Make sure your css selectors are correct. Test selectors BEFORE the browser loads the page. Or disable javascript in your browser. More details:

Also as simple debugging you can use print() or self.logger for print your current data.

Example: self.logger.info(business_pages) or print(business_pages) and you will see the result in the logs

Upvotes: 0

Kiran Zafar

Reputation: 1

This happens because Yelp's robots.txt file disallows web crawlers from accessing the specific URLs you were trying to scrape.

Like many websites, Yelp uses robots.txt to instruct web crawlers on which parts of their website are off-limits to indexing and scraping. When a website's robots.txt file disallows a particular URL or directory, web crawlers like Scrapy typically respect those rules and do not fetch the disallowed content.

In your case, it seems that Yelp has disallowed access to the page you were attempting to scrape. You can check the robots.txt file from the following link

https://www.yelp.com/robots.txt

Upvotes: -1

I am using scrapy to scrape data from Yelp. I cannot see any error but data is not getting scraped from the StartURLs mentioned in the spider

Answers (2)

Related Questions