Kamikaze_goldfish
Kamikaze_goldfish

Reputation: 861

Python Scrapy not outputting to csv file

What am I doing wrong with the script so it's not outputting a csv file with the data? I am running the script with scrapy runspider yellowpages.py -o items.csv and still nothing is coming out but a blank csv file. I have followed different things here and also watched youtube trying to figure out where I am making the mistake and still cannot figure out what I am not doing correctly.

# -*- coding: utf-8 -*-
import scrapy
import requests

search = "Plumbers"
location = "Hammond, LA"
url = "https://www.yellowpages.com/search"
q = {'search_terms': search, 'geo_location_terms': location}
page = requests.get(url, params=q)
page = page.url
items = ()


class YellowpagesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['yellowpages.com']
    start_urls = [page]

    def parse(self, response):
        self.log("I just visited: " + response.url)
        items = response.css('a[class=business-name]::attr(href)')
        for item in items:
            print(item)

Upvotes: 0

Views: 807

Answers (3)

Yash Pokar
Yash Pokar

Reputation: 5491

for item in items:
    print(item)

put yield instead of print there,

for item in items:
    yield item

Upvotes: 1

soldy
soldy

Reputation: 366

Simple spider without project.

Use my code, I wrote comments to make it easier to understand. This spider looks for all blocks on all pages for a pair of parameters "service" and "location". To run, use:

In your case:

scrapy runspider yellowpages.py -a servise="Plumbers" -a location="Hammond, LA" -o Hammondsplumbers.csv

The code will also work with any queries. For example:

scrapy runspider yellowpages.py -a servise="Doctors" -a location="California, MD" -o MDDoctors.json

etc...

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.exceptions import CloseSpider


class YellowpagesSpider(scrapy.Spider):
    name = 'yellowpages'
    allowed_domains = ['yellowpages.com']
    start_urls = ['https://www.yellowpages.com/']

    # We can use any pair servise + location on our request
    def __init__(self, servise=None, location=None):
        self.servise = servise
        self.location = location

    def parse(self, response):
        # If "service " and" location " are defined 
        if self.servise and self.location:
            # Create search phrase using "service" and " location"
            search_url = 'search?search_terms={}&geo_location_terms={}'.format(self.servise, self.location)
            # Send request with url "yellowpages.com" + "search_url", then call parse_result
            yield Request(url=response.urljoin(search_url), callback=self.parse_result)
        else:
            # Else close our spider
            # You can add deffault value if you want.
            self.logger.warning('=== Please use keys -a servise="service_name" -a location="location" ===')
            raise CloseSpider()

    def parse_result(self, response):
        # all blocks without AD posts
        posts = response.xpath('//div[@class="search-results organic"]//div[@class="v-card"]')
        for post in posts:
            yield {
                'title': post.xpath('.//span[@itemprop="name"]/text()').extract_first(),
                'url': response.urljoin(post.xpath('.//a[@class="business-name"]/@href').extract_first()),
            }

        next_page = response.xpath('//a[@class="next ajax-page"]/@href').extract_first()
        # If we have next page url
        if next_page:
            # Send request with url "yellowpages.com" + "next_page", then call parse_result
            yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse_result)

Upvotes: 3

Woody1193
Woody1193

Reputation: 8010

On inspection of your code, I notice a number of problems:

First, you initialize items to a tuple, when it should be a list: items = [].

You should change your name property to reflect the name you want on your crawler so you can use it like so: scrapy crawl my_crawler where name = "my_crawler".

start_urls is supposed to contain strings, not Request objects. You should change the entry from page to the exact search string you want to use. If you have a number of search strings and want to iterate over them, I would suggest using a middleware.

When you try to extract the data from CSS you're forgetting to call extract_all() which would actually transform your selector into string data which you could use.

Also, you shouldn't be redirecting to the standard output stream because a lot of logging goes there and it'll make your output file really messy. Instead, you should extract the responses into items using loaders.

Finally, you're probably missing the appropriate settings from your settings.py file. You can find the relevant documentation here.

FEED_FORMAT = "csv"
FEED_EXPORT_FIELDS = ["Field 1", "Field 2", "Field 3"]

Upvotes: 0

Related Questions