MaxxABillion
MaxxABillion

Reputation: 591

scrapy-based crawler not extracting content within <p> tags

I have a custom crawler that scrapes news articles. For the most part it works, however, when adding new urls, it's sometimes hard to figure out what css selectors to use to get the content I want. Below is the code of what i'm working on.

# -*- coding: utf-8 -*-
""" Script to crawl Article from shttps://mycbs4.com
"""
try:
    from crawler import BaseCrawler
except:
    from __init__ import BaseCrawler


class Cmycbs4Crawler(BaseCrawler):
    start_urls = [
        'https://mycbs4.com/search?find=cannabis',
        'https://mycbs4.com/search?find=marijuana',
        'https://mycbs4.com/search?find=cbd',
        'https://mycbs4.com/search?find=thc',
        'https://mycbs4.com/search?find=hemp'
    ]

    source_id = 'mycbs4'

    config_selectors = {
        # Css selector on articles page (the page list many articles)
        'POST_URLS': '.sd-main a::attr(href)',
        #'NEXT_PAGE_URL': '.pager-next > a::attr(href)', # default

        # Css selector on article's detail page (the page display full content of article)
        'ARTICLE_CONTENT': '#js-Story-Content-0 > p',
    }

if __name__ == "__main__":
    crawler = Cmycbs4Crawler()
    crawler.run()

The crawler should crawl the urls and populate everything back into a DB. It scrapes everything except the content.

I've tried the follow selectors

'#js-Story-Content-0 > p', .StoryText_storyText__1uZ3 > p' #js-Story-Content-0 .StoryText_storyText__1uZ3 > p

None of them leads to scraped content from the article. So, i'm not sure what i'm doing wrong.

Below is a screenshot of the content/p tags i'm trying to scrape

enter image description here

Any help would be greatly appreciated

Upvotes: 0

Views: 45

Answers (1)

mdaniel
mdaniel

Reputation: 33203

Your content lives in <script data-prerender="facade" type="application/json">, which is great because you don't have to go spelunking around in the HTML to parse the information you want, you can use json.loads instead

BTW, it's a dead giveaway when you see a class name of js-Story-Content-0 and you cannot find any of those <blockquote> elements in the page source; the page source is not equal to the page DOM and Scrapy always sees only the page source, not the DOM.

Upvotes: 1

Related Questions