Reputation: 127
I'm using requests and BeautifulSoup4 to download and scrape information from a webpage, I have it successfully narrowing down to everything inside of a particular <script> tag that i'm trying to get data out of. For the purposes of getting this part of the code working, I'm skipping all the requests and BS4 stuff and just adding this string at the beginning of my code like so:
Content = '''// <![CDATA[
devicetype = "computer";
isios = false;
videocdn = "media";
videopath = "updates/na/vid01";
poster = {
"file": "preview/vidsplash.jpg",
"st": "1557499029",
"et": "1557502629",
"hs": "f3ad16f42fec5224d323915cdfbf43ed"
};
attachname = "some-video-00001234";
videos[0] = {
"wmv": {
"file": "wmv/01.wmv",
"name": "01",
"duration": 502,
"size": "195.1MB",
"wid": 854,
"hgt": 480,
"st": "1557499029",
"et": "1557502629",
"hs": "a0cfdef3b8b9e3dea576368a5bfbaef9",
"caps": []
},
"h264": {
"file": "h264/01.mp4",
"name": "01",
"duration": 502,
"size": "73.9MB",
"wid": 854,
"hgt": 480,
"st": "1557499029",
"et": "1557502629",
"hs": "32901a1870d0b32458b465ac9c3d6cad",
"caps": [{
"file": "001.jpg",
"fs": {
"st": "1557499029",
"et": "1557502629",
"hs": "5b328642a84fa6406bda527c18e46c27"
},
"tn": {
"st": "1557499029",
"et": "1557502629",
"hs": "0a4ad7d0edf1b92538b8127f8e297c41"
}
}, {
"file": "002.jpg",
"fs": {
"st": "1557499029",
"et": "1557502629",
"hs": "4390c0d9b321b5e86c88cb8ca5e56ede"
},
"tn": {
"st": "1557499029",
"et": "1557502629",
"hs": "9cf83158268379df660d6d01750a047c"
}
}]
}
};
// ]]>'''
Also note this is prettified. Normally the "poster" and "videos[0]" variables would each be in its own line, not multi lines and indented like it is. And this isn't the complete set of data from the <script> tag, I just stripped down the repeated parts just so y'all could get an idea of the structure of the data. Also note that "videos[0]" will repeat a similar data structure into "videos[1]" and so on a variable number of times.
What i'm trying to do is get that big multi line string somehow converted into a proper dictionary that i can manipulate in my python code to extract the bits i need
print(NewContent)
Output:
{'devicetype' = 'computer', 'isios' = False, "videocdn" = "media"}
And so on.
I've been messing around with js2py trying to get it to do what I need it to do, but so far the farthest I've gotten was with this code:
splitrawlines = CONTENT.splitlines()
rawvars = []
for line in splitrawlines:
# need to add the videos declaration in case it gets to a line where it expects it to already be declared.
rawvars.append(js2py.eval_js("videos = [];\n" + line))
print(rawvars)
Only problem is it doesn't output it as a dict, it outputs it as a list, which I could still probably make work, but it isn't even a list that python can manipulate, it's technically still a js2py.base.JsObjectWrapper
object. I can convert that object to a string, but the only ways I can find of converting a string to a list is separating everything in the string by spaces and throwing each separated part into its own entries in the list. I basically have an already formatted list, just inside a string.
I may be going the wrong direction with that code but it's the closest I've gotten thus far. So I need to either find a way to convert a string that's basically already formatted as a full fledged list into an actual list object, or more preferably, find some different way of getting all the variables in random JavaScript code into native python variables that I can manipulate.
Upvotes: 0
Views: 692
Reputation: 142859
JavaScript data mostly is in JSON format so you can use python module json
to convert it to pythons dictionary
.
As example data after "videos[0] = "
creates correct JSON data and you can use data = json.loads(stringg)
to create dictionary - and then you can get ie. data['wmv']['size']
data = '''{
"wmv": {
"file": "wmv/01.wmv",
"name": "01",
"duration": 502,
"size": "195.1MB",
"wid": 854,
"hgt": 480,
"st": "1557499029",
"et": "1557502629",
"hs": "a0cfdef3b8b9e3dea576368a5bfbaef9",
"caps": []
},
"h264": {
"file": "h264/01.mp4",
"name": "01",
"duration": 502,
"size": "73.9MB",
"wid": 854,
"hgt": 480,
"st": "1557499029",
"et": "1557502629",
"hs": "32901a1870d0b32458b465ac9c3d6cad",
"caps": [{
"file": "001.jpg",
"fs": {
"st": "1557499029",
"et": "1557502629",
"hs": "5b328642a84fa6406bda527c18e46c27"
},
"tn": {
"st": "1557499029",
"et": "1557502629",
"hs": "0a4ad7d0edf1b92538b8127f8e297c41"
}
}, {
"file": "002.jpg",
"fs": {
"st": "1557499029",
"et": "1557502629",
"hs": "4390c0d9b321b5e86c88cb8ca5e56ede"
},
"tn": {
"st": "1557499029",
"et": "1557502629",
"hs": "9cf83158268379df660d6d01750a047c"
}
}]
}
}'''
import json
data = json.loads(data)
print(data['wmv']['size'])
# 195.1MB
If every variable is one line then you can use split('\n')
to get lines and then use split('=')
to get key and value.
Then you have to only check if value starts with {
or [
to use json
. Other values can be normal string so they don't need json
- it may need only to remove "
.
Content = '''// <![CDATA[
devicetype = "computer";
isios = false;
videocdn = "media";
videopath = "updates/na/vid01";
poster = {"file": "preview/vidsplash.jpg","st": "1557499029","et": "1557502629","hs": "f3ad16f42fec5224d323915cdfbf43ed"};
attachname = "some-video-00001234";'''
import json
results = {}
for line in Content.split('\n'):
if ' = ' in line:
line = line[:-1] # remove `;`
key, val = line.split(' = ', 1)
if val.startswith( ('[', '{') ):
results[key] = json.loads(val)
elif val.startswith('"'):
val = val[1:-1] # remove `"`
results[key] = val
elif val == 'false':
results[key] = False
elif val == 'true':
results[key] = True
print(results['devicetype'])
print(results['isios'])
print(results['videocdn'])
print(results['poster']['file'])
# computer
# False
# media
# preview/vidsplash.jpg
Upvotes: 1