Reputation: 591
I have a custom crawler that scrapes news articles. For the most part it works, however, when adding new urls, it's sometimes hard to figure out what css selectors to use to get the content I want. Below is the code of what i'm working on.
# -*- coding: utf-8 -*-
""" Script to crawl Article from shttps://mycbs4.com
"""
try:
from crawler import BaseCrawler
except:
from __init__ import BaseCrawler
class Cmycbs4Crawler(BaseCrawler):
start_urls = [
'https://mycbs4.com/search?find=cannabis',
'https://mycbs4.com/search?find=marijuana',
'https://mycbs4.com/search?find=cbd',
'https://mycbs4.com/search?find=thc',
'https://mycbs4.com/search?find=hemp'
]
source_id = 'mycbs4'
config_selectors = {
# Css selector on articles page (the page list many articles)
'POST_URLS': '.sd-main a::attr(href)',
#'NEXT_PAGE_URL': '.pager-next > a::attr(href)', # default
# Css selector on article's detail page (the page display full content of article)
'ARTICLE_CONTENT': '#js-Story-Content-0 > p',
}
if __name__ == "__main__":
crawler = Cmycbs4Crawler()
crawler.run()
The crawler should crawl the urls and populate everything back into a DB. It scrapes everything except the content.
I've tried the follow selectors
'#js-Story-Content-0 > p',
.StoryText_storyText__1uZ3 > p'
#js-Story-Content-0 .StoryText_storyText__1uZ3 > p
None of them leads to scraped content from the article. So, i'm not sure what i'm doing wrong.
Below is a screenshot of the content/p tags i'm trying to scrape
Any help would be greatly appreciated
Upvotes: 0
Views: 45
Reputation: 33203
Your content lives in <script data-prerender="facade" type="application/json">
, which is great because you don't have to go spelunking around in the HTML to parse the information you want, you can use json.loads
instead
BTW, it's a dead giveaway when you see a class name of js-Story-Content-0
and you cannot find any of those <blockquote>
elements in the page source; the page source is not equal to the page DOM and Scrapy always sees only the page source, not the DOM.
Upvotes: 1