Connecting a Web Scraper to an Asset in Dagster without the Pipeline Module

Question

I want to scrape the content of a website in dagster with scrappy. Unfortunately, all the examples I have found use the pipeline module of dagster. The current version does not have this pipeline plugin.

I have this scraper and its parse function which returns all headings of the document. These headings are to be used in an assset. How do I connect the asset and the crawler?

    import scrapy
    from dagster import asset, AssetExecutionContext

    class MySpider(scrapy.Spider):
        name = 'headless'

        
        def start_requests(self):
            urls = ['http://google.com']  # Geben Sie hier die URL der HTML-Seite ein
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)
        
        def parse(self, response):
            headlines = response.css('h1::text').getall()  
            yield {'headlines': headlines}

    spider = MySpider()  

    @asset()
    def headlines(context: AssetExecutionContext):
        headlines = spider.parse()

This is just a non-working example that I need some advice on.

Connecting a Web Scraper to an Asset in Dagster without the Pipeline Module

Answers (0)

Related Questions