Anthony
Anthony

Reputation: 111

Leveraging Beautifulsoup within Scrapy

I have created a simple crawler with Scrapy that starts at a given link and follows all links within a given DEPTH_LIMIT that is adjusted each time I run the spider because of the project parameters. The script prints response URLs for the sake of simplicity.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from NONPROF.items import NonprofItem
from scrapy.http import Request
import re

class Nonprof(CrawlSpider):
    name = "my_scraper"
    allowed_domains = ["stackoverflow.com"]
    start_urls = ["https://stackoverflow.com"]

    rules = [
        Rule(LinkExtractor(
            allow=['.*']),
             callback='parse_item',
             follow=True)
        ]

    def parse_item (self, response):
        print (response.url)

My current objective is to parse all visible text within a given depth from the starting url and use that data for topic modelling. I have done something similar in the past using BeautifulSoup, but I would like to leverage the following parsing language within my crawler.

from bs4 import BeautifulSoup
import bs4 as bs
import urllib.request

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    elif isinstance(element,bs.element.Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(html, 'lxml')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('https://stackoverflow.com').read()
print(text_from_html(html))

My difficulties in integrating the two is the logic behind a BeautifulSoup object and the Response from Scrapy.

Any input is appreciated.

Upvotes: 2

Views: 6066

Answers (1)

alecxe
alecxe

Reputation: 474161

At the very least, you can just pass the HTML source contained in the response.body directly to BeautifulSoup to parse:

soup = BeautifulSoup(response.body, "lxml") 

Note though, while this would work and you can use soup to HTML parse the desired data, you are not using a huge part of Scrapy - Scrapy selectors, the use of selectors in Item Loaders etc. If I were you, I'd just make myself comfortable with the powers of Scrapy's way of extracting the data out of HTML.

Upvotes: 5

Related Questions