Reputation: 3272
I want to scrape the content of a website in dagster with scrappy. Unfortunately, all the examples I have found use the pipeline module of dagster. The current version does not have this pipeline plugin.
I have this scraper and its parse function which returns all headings of the document. These headings are to be used in an assset. How do I connect the asset and the crawler?
import scrapy
from dagster import asset, AssetExecutionContext
class MySpider(scrapy.Spider):
name = 'headless'
def start_requests(self):
urls = ['http://google.com'] # Geben Sie hier die URL der HTML-Seite ein
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
headlines = response.css('h1::text').getall()
yield {'headlines': headlines}
spider = MySpider()
@asset()
def headlines(context: AssetExecutionContext):
headlines = spider.parse()
This is just a non-working example that I need some advice on.
Upvotes: 1
Views: 97