SpaceDust__
SpaceDust__

Reputation: 4914

How to scrape and parse nested div with scrapy

Trying to follow this github page in order to learn crawl nested divs in facebook. https://github.com/talhashraf/major-scrapy-spiders/blob/master/mss/spiders/facebook_profile.py

enter image description here parse_info_text_only or parse_info_has_image in the file works fine getting the span information

I have a similar page that I am trying to get the result_id from a nested div, however result_id is in div itself.

enter image description here

From what I understand div I am trying to scrape is in 2nd row, so I try something like

def parse_profile(self, response):
       item["BrowseResultsContainer"] = self.parse_info_has_id(response.css('#BrowseResultsContainer'))
return item

def parse_info_has_id(self, css_path):
           text = css_path.xpath('div/div').extract()
           text = [t.strip() for t in text]
           text = [t for t in text if re.search('result_id', t)]
           return "\n".join(text)

How can I get the data-xt from above nested div?

Upvotes: 0

Views: 1164

Answers (2)

eLRuLL
eLRuLL

Reputation: 18799

with css:

import json
...
    def parse_info_has_id(self, css_path):
        text = css_path.xpath('div::attr(data-gt)').extract_first()
        d = json.loads(text)
        return d['result_id']

Upvotes: 1

Noah.Kim
Noah.Kim

Reputation: 119

I think, If you want all data-xt then

def parse_info_has_id(self, css_path):
       text = css_path.xpath('//div[@data-xt != ""]').extract()
       text = [t.strip() for t in text]
       text = [t for t in text if re.search('result_id', t)]
       return "\n".join(text)

Upvotes: 0

Related Questions