Reputation: 111
I have created a simple crawler with Scrapy
that starts at a given link and follows all links within a given DEPTH_LIMIT
that is adjusted each time I run the spider because of the project parameters. The script prints response URLs for the sake of simplicity.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from NONPROF.items import NonprofItem
from scrapy.http import Request
import re
class Nonprof(CrawlSpider):
name = "my_scraper"
allowed_domains = ["stackoverflow.com"]
start_urls = ["https://stackoverflow.com"]
rules = [
Rule(LinkExtractor(
allow=['.*']),
callback='parse_item',
follow=True)
]
def parse_item (self, response):
print (response.url)
My current objective is to parse all visible text within a given depth from the starting url and use that data for topic modelling. I have done something similar in the past using BeautifulSoup
, but I would like to leverage the following parsing language within my crawler.
from bs4 import BeautifulSoup
import bs4 as bs
import urllib.request
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
elif isinstance(element,bs.element.Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(html, 'lxml')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
html = urllib.request.urlopen('https://stackoverflow.com').read()
print(text_from_html(html))
My difficulties in integrating the two is the logic behind a BeautifulSoup
object and the Response
from Scrapy
.
Any input is appreciated.
Upvotes: 2
Views: 6066
Reputation: 474161
At the very least, you can just pass the HTML source contained in the response.body
directly to BeautifulSoup
to parse:
soup = BeautifulSoup(response.body, "lxml")
Note though, while this would work and you can use soup
to HTML parse the desired data, you are not using a huge part of Scrapy - Scrapy
selectors, the use of selectors in Item Loaders etc. If I were you, I'd just make myself comfortable with the powers of Scrapy's way of extracting the data out of HTML.
Upvotes: 5