pekasus
pekasus

Reputation: 646

Link Harvesting in Scrapy

I am both amazed and really frustrated with Scrapy. It seems like there is too much power under the hood, making it a really steep learning curve. Apparently, Scrapy can do everything that I used to program myself, but the problem is figuring out how to make it do what I want.

For now, I am writing a simple link harvester. I want to export two files: one with internal links and link text, and another with external link and link text.

I have been trying to us the -o file.csv command, but it lumps each page url into a single cell as a list, and it includes duplicates.

The alternative that I have would be to just write my own code in 'parse' and manually create a list of links and check to see if they exist in the list before adding them, and then manually parse the url to see if the domain in internal or external.

It seems like Scrapy should do this with a few commands. Is there a built-in method for this?

Here's the code that I am working with. I commented out the title part bc I think I need to make another item object for those. I've abandoned that part for now.

    def parse_items(self, response):
    item = WebconnectItem()
    sel = Selector(response)
    items = []
#    item["title"] = sel.xpath('//title/text()').extract()
#    item["current_url"] = response.url
    item["link_url"] = sel.xpath('//a/@href').extract()
    item["link_text"] = sel.xpath('//a/text()').extract()
    items.append(item)
    return items

Upvotes: 1

Views: 489

Answers (3)

Slater Victoroff
Slater Victoroff

Reputation: 21914

So, your thoughts about scrapy are largely accurate. Very powerful, steep learning curve, but has a lot of promise if you can get past that part. There are even a few value-added services on top like ScrapingHub that can take care of rotating IP addresses, keeping jobs running, etc...

The difference is that scrapy works using item pipelines rather than the traditional model. They have great code to resolve exactly the issue that you're having. Any processing of results should happen in these item pipelines rather than in the scraper itself. The docs are here, and this is an example pipeline for removing duplicates and writing to json:

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

class JsonWriterPipeline(object):

    def __init__(self):
        self.file = open('items.jl', 'wb')

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

Upvotes: 1

paul trmbrth
paul trmbrth

Reputation: 20748

Scrapy has extensive documentation and the tutorial is a good introduction.

It's built on top of Twisted so you have to think in terms of asynchronous requests and responses, which is quite different from what you usually do with python-requests and BS4. python-requests blocks your thread when issuing HTTP requests. Scrapy does not, it lets you process responses while other requests may be over the wire.

You can use BS4 in scrapy callbacks (e.g. in your parse_items method).

You're right that Scrapy will output 1 item per line in its output. It will not do any deduplication of URLs because items are just items for Scrapy. They happen to contain URLs in your case. Scrapy does no deduplication of items based on what they contain. You'd have to instruct it to do so (with an item pipeline for example)

As for URLs represented as lists in your link_url and link_text fields, it's because sel.xpath('//a/@href').extract() returns lists

Scrapy 1.0 (soon to be released) adds an .extract_first() method that would help in your case.

Upvotes: 1

taesu
taesu

Reputation: 4570

Implementation using requests & bs4
note it's not the optimal solution, but it shows how it can be done using requests and bs4.
the goto setup for me.

import requests
from bs4 import BeautifulSoup
URL = "http://www.cnn.com/"

# get request
r = requests.get(URL) 
# turn into bs instance
soup = BeautifulSoup(r.text) 
# get all links
links = soup.findAll('a')

internal_unique = []
external_unique = []
internal_links = []
external_links = []

for link in links:
    if 'ccn.com' in link['href'] or link['href'].startswith('/'):
        if link['href'] not in internal_unique:
            internal_links.append({'link':link['href'],'text':link.get_text()})
            internal_unique.append(link['href'])
    else:
        if link['href'] not in external_unique:
            external_links.append({'link':link['href'],'text':link.get_text()})
            external_unique.append(link['href'])
print internal_links
print external_links

Upvotes: 0

Related Questions