Manuel
Manuel

Reputation: 802

Scraping data from a http & javaScript site

I currently want to scrape some data from an amazon page and I'm kind of stuck.

For example, lets take this page.

https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1

I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.

enter image description here

There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.

For example, in asinToDimentionIndexMap we can see

"B01KWIUH5M":[0,0]

Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)

I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.

Another person in the site (thanks for the help btw) suggested doing it this way.

script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_') 

import json
d = json.loads(data[0])
d['products'][0]

I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.

Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.

Thanks for the help!

EDIT: Added photo of variationValues and asinToDimensionIndexMap

Upvotes: 0

Views: 138

Answers (2)

ThunderMind
ThunderMind

Reputation: 799

variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]

Now you can easily convert them to json as use them combine as you wish.

Upvotes: 1

Daniel Scott
Daniel Scott

Reputation: 985

I think you are close Manuel!

The following code will turn your scraped source into easy-to-select boxes:

import json
d = json.loads(data[0])

JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.

https://www.w3schools.com/js/js_json_intro.asp

I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.

Your code format looks correct, but your access within "each box" may look different.

Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):

d['products'][0]['asinToDimentionIndexMap']

I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.

JSON Object Viewer

For example, the following would yield "companyCompliancePolicies_feature_div":

import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']

The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.

Upvotes: 1

Related Questions