Scrapy, scraping text in HTML tag when there are no quotation marks?

Question

UPDATE: this number 48 is showed in "Inspect" in Chrome, but not in "View Page Source". Now understand that it is generated by JavaScript and that is why I can not extract it.

This is part of HTML that I am trying to scrape


     48 
     "times"

Problem is that I can not get this 48 number.
I think that problem is because there are no "" around 48.
Because I can get "times" text with no problems, and the only difference that I can see is that there are no "" around 48.

This is code that is working for "times":

response.xpath('.//span[@class="value"]/text()').extract_first()
>>> u'times'

For 48:

response.xpath('.//span[@class="base-entity-display-count"]').extract_first()
>>> u''

As you can see, 48 is missing.

Does anybody have some solution or idea?

Granitosaurus · Accepted Answer

If you look at the body of the page and search for your number you can see that there's some embeded json.

To solve this you can:

find embeded json with regex:

import re
# select everything between "ap.boot.push(" and ");"
data = re.findall('app.boot.push$(\{.+?\})$;', response.body_as_unicode())

load up json and parse it with python to find the values you want:

import json
data = [json.loads(d) for d in data]
for d in data:
    if d.get('name') == 'BaseEntityDetails':
        print(d['values']['displayCountText'])
#prints: 66

Scrapy, scraping text in HTML tag when there are no quotation marks?

Answers (1)

Related Questions