ckz
ckz

Reputation: 5

How to assign the url that's being scraped from to an item?

I'm pretty new to Python and Scrapy and this site has been an invaluable resource so far for my project, but now I'm stuck on a problem that seems like it'd be pretty simple. I'm probably thinking about it the wrong way. What I want to do is add a column to my output CSV that lists the URL that each row's data was scraped from. In other words, I want the table to look like this:

item1    item2    item_url
a        1        http://url/a
b        2        http://url/a
c        3        http://url/b
d        4        http://url/b    

I'm using psycopg2 to get a bunch of urls stored in database that I then scrape from. The code looks like this:

class MySpider(CrawlSpider):
    name = "spider"

    # querying the database here...

    #getting the urls from the database and assigning them to the rows list
    rows = cur.fetchall()

    allowed_domains = ["www.domain.com"]

    start_urls = []

    for row in rows:

        #adding the urls from rows to start_urls
        start_urls.append(row)

        def parse(self, response):
            hxs = HtmlXPathSelector(response)
            sites = hxs.select("a bunch of xpaths here...")
            items = []
            for site in sites:
                item = SettingsItem()
                # a bunch of items and their xpaths...
                # here is my non-working code
                item['url_item'] = row
                items.append(item)
            return items

As you can see, I wanted to make an item that just takes the url that the parse function is currently on. But when I run the spider, it gives me "exceptions.NameError: global name 'row' is not defined." I think that this is because Python doesn't recognize row as a variable within the XPathSelector function, or something like that? (Like I said, I'm new.) Anyway, I'm stuck, and any help would be much appreciated.

Upvotes: 0

Views: 591

Answers (1)

warvariuc
warvariuc

Reputation: 59664

Put the start requests generation not in class body but in start_requests():

class MySpider(CrawlSpider):

    name = "spider"
    allowed_domains = ["www.domain.com"]

    def start_requests(self):
        # querying the database here...

        #getting the urls from the database and assigning them to the rows list
        rows = cur.fetchall()

        for url, ... in rows:
            yield self.make_requests_from_url(url)


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("a bunch of xpaths here...")

        for site in sites:
            item = SettingsItem()
            # a bunch of items and their xpaths...
            # here is my non-working code
            item['url_item'] = response.url

            yield item

Upvotes: 2

Related Questions