Reputation: 7302
UPDATE: this number 48 is showed in "Inspect" in Chrome, but not in "View Page Source". Now understand that it is generated by JavaScript and that is why I can not extract it.
This is part of HTML that I am trying to scrape
<span class="value">
<span class="base-entity-display-count">48</span>
"times"
</span>
Problem is that I can not get this 48 number.
I think that problem is because there are no "" around 48.
Because I can get "times" text with no problems, and the only difference that I can see is that there are no "" around 48.
This is code that is working for "times":
response.xpath('.//span[@class="value"]/text()').extract_first()
>>> u'<span class="value"><span class="base-entity-display-count"></span>times</span>'
For 48:
response.xpath('.//span[@class="base-entity-display-count"]').extract_first()
>>> u'<span class="base-entity-display-count"></span>'
As you can see, 48 is missing.
Does anybody have some solution or idea?
Upvotes: 0
Views: 579
Reputation: 21446
If you look at the body of the page and search for your number you can see that there's some embeded json.
To solve this you can:
find embeded json with regex:
import re
# select everything between "ap.boot.push(" and ");"
data = re.findall('app.boot.push\((\{.+?\})\);', response.body_as_unicode())
load up json and parse it with python to find the values you want:
import json
data = [json.loads(d) for d in data]
for d in data:
if d.get('name') == 'BaseEntityDetails':
print(d['values']['displayCountText'])
#prints: 66
Upvotes: 3