XRemixX
XRemixX

Reputation: 11

Scrapy cannot find inside <div> tags

good day. I'm currently writing a Scrapy program to scrape a news website. I'm a beginner in Scrapy, and I've come into a bump that unable me to progress in my code.

The website that I'm currrently trying to scrape is https://www.thestar.com.my/news/nation

Inside the page's html tags, there's a div tag with class="row list-listing". I'm trying to get the paragraph tag inside the div tag, but somehow Scrapy can't seem to find the tag.

I've checked for the any not closed tags, but all of them seem to be closed. So why did Scrapy unable to fetch this tag? The most inner tag that Scrapy can fetch is div class="sub-section-list" which is outside the div class="row list-listing"

Also, when I fetch the div class="sub-section-list" tag, it only extract these html tags:

"<div class=""sub-section-list"">
     <div class=""button-view btnLoadMore"" style=""margin: 10px auto 15px;"">
         <a id=""loadMorestories"">Load more </a>
     </div>
 </div>"

When inspecting the website, there are these tags that I need

Website Tag

I will include my basic code. I've only started the project so I haven't made any progress since this problem.

import scrapy


class WebCrawl(scrapy.Spider):
    name = "spooder"
    allowed_domains = ["thestar.com.my"]
    start_urls = ["https://www.thestar.com.my/news/nation"]

    def parse(self, response):
        text = response.xpath("//div[@class='sub-section-list']").extract()
        yield {
            'text' : text
        }

If I forgot to add any other necessary things please tell. Any help would be very appreciated.

Upvotes: 1

Views: 1392

Answers (2)

tomjn
tomjn

Reputation: 5389

As Wim says the page is being loaded dynamically so there are a few options. Using Firefox developer tools it looks like the content is being retrieved from:

https://cdn.thestar.com.my/Content/Data/parsely_data.json

So you could directly load the json and get what you want from there. Something like:

import scrapy
import json

class WebCrawl(scrapy.Spider):
    name = "spooder"
    allowed_domains = ["thestar.com.my"]
    start_urls = ["https://cdn.thestar.com.my/Content/Data/parsely_data.json"]

    def parse(self, response):
        yield from json.loads(response.text)['data']

Of course, this probably isn't exactly what you want, but perhaps it is a good start?

(Note that the above code is overkill for what it does, but if you are going to start some scraping you can work from that)

Upvotes: 2

Wim Hermans
Wim Hermans

Reputation: 2116

The content is loaded dynamically so you won't be able to use xpath like this without rendering the page. It seems the article bodies are present in the html, and you can get it as follows:

import json
script = response.xpath(
  "//script[contains(text(), 'var listing = ')]/text()"
).extract_first()

first_index = script.index('var listing = ') + len('var listing = ')
last_index = script.index('};') + 1
listings = json.loads(script[first_index:last_index])
articles = [article['article_body'] for article in listings['data']] 

Upvotes: 1

Related Questions