Reputation: 1802
I've got a script to get the page source of a webpage and there's a value I'm trying to get from it but the returned string is a mixtured of html, json and JavaScript. I'd show you the page source but some parts of it have sensitive data as I haven't gotten around to fix that yet. If you need an example what I get back then I can make something up, other than that, this is the small python script so far:
import requests as r
def app(url):
content = ""
request = r.get(url);
content = request.content
print(content)
I tried to find the string by a simple string.find()
command. I have no clue how to throw away all the useless bits of the result, it's not like I can just parse the json part and store it somewhere (which would then allow me access to the value easily), is it?
Thanks.
EDIT:
Here's an example input, and the output (not what my script actually goes for but I remembered the page source from Instagram posts is similar.
Input:
view-source:https://www.instagram.com/p/B-U4-cVAp5y/
Output: Link to file is here, can't add it to the question as it's so large.
There's is a json part at the bottom of the code, inside the json, somewhere is a value called 'video_url' and I am trying to get that value, but obviously not on Instagram. I have stripped the json from the full result and made it prettier so you can see it easily, which you can find here and the value I'm trying to retrieve looks like this:
"video_url":"https://scontent-lhr8-1.cdninstagram.com/v/t50.2886-16/90894630_221502022556337_2214905061309385826_n.mp4?_nc_ht=scontent-lhr8-1.cdninstagram.com\u0026_nc_cat=109\u0026_nc_ohc=waOdsa3MtFcAX83adIS\u0026oe=5E8413A8\u0026oh=d6ba6cb583afd7f341f6844c0fd02dbf"
I can't get to that json, however as there is so much stuff going on, I can't find a decent module to search it.
Upvotes: 0
Views: 1790
Reputation: 151
I want to share a couple other approaches that use Beautiful Soup. There could be some advantages over simply using a regular expression since this parses the page data similar to how a real web browser would.
# Sample content based on the format of <https://pastebin.com/raw/YGPupvjj>
content = '''
<!DOCTYPE html>
<html lang="en">
<head>
<title>Fake Page</title>
<script type="text/javascript">
(function() { var xyz = 'Some other irrelevant script block'; })();
</script>
</head>
<body>
<p>Dummy body content</p>
<script type="text/javascript">
window._sharedData = {
"entry_data": {
"PostPage": [{
"graphql": {
"shortcode_media": {
"edge_media_to_tagged_user": {
"edges": [{
"node": {
"user": {
"full_name": "John Doe",
"id": "132389782",
"is_verified": false,
"profile_pic_url": "https://example.com/something.jpg",
"username": "johndoe"
}
}
}]
}
}
}
}]
}
};
</script>
</body>
</html>
'''
If instead you want to try this with actual page data, you can fetch it:
import requests
request = requests.get('https://pastebin.com/raw/YGPupvjj')
content = request.content
Use Beautiful Soup to parse the web content:
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')
Beautiful Soup gives us easy access to the <script>
blocks that contain your data,
but it only returns it as a string. It can't parse JavaScript. Here are two ways to extract the data.
import json
import re
# Search JSON data recursively and yield any dict item value with
# key "profile_pic_url"
def search(d):
if isinstance(d, list):
for x in d:
yield from search(x)
return
if not isinstance(d, dict):
return
url = d.get('profile_pic_url')
if url:
yield url
for v in d.values():
yield from search(v)
for script_block in soup.find_all('script'):
if not script_block.string:
continue
m = re.fullmatch(r'(?s)\s*window\._sharedData\s*=\s*({.*\});\s*', script_block.string)
if m is not None:
data = json.loads(m.group(1))
for x in search(data):
print(x)
<script>
blocks, and search for the literal key in the parsed syntax tree.import pyjsparser
# Search the syntax tree recursively and yield value of
# JS Object property with literal key "profile_pic_url"
def search(d):
if isinstance(d, list):
for i, x in enumerate(d):
yield from search(x)
if not isinstance(d, dict):
return
if d['type'] == 'ObjectExpression':
for p in d['properties']:
if (p['key']['type'] == 'Literal'
and p['value']['type'] == 'Literal'
and p['key']['value'] == 'profile_pic_url'):
yield p['value']['value']
yield from search(p['key'])
yield from search(p['value'])
return
for k, v in d.items():
yield from search(v)
for script_block in soup.find_all('script'):
if not script_block.string:
continue
try:
code = pyjsparser.parse(script_block.string)
except pyjsparser.JsSyntaxError:
continue
for found in search(code):
print(found)
Upvotes: 0
Reputation: 7353
You can use regular expressions (regex) to do this. You need to import re
and then use the following to get a list of all the video_urls
import re
re.findall('\"video_url\":\s*\"(.[^\s]*)\"\s', str(content))
# suppose this is the text in your "content"
content = '''
"video_url":"https://scontent-lhr8-1.cdninstagram.com/v/t50.2886-16/90894630_221502022556337_2214905061309385826_n.mp4?_nc_ht=scontent-lhr8-1.cdninstagram.com\u0026_nc_cat=109\u0026_nc_ohc=waOdsa3MtFcAX83adIS\u0026oe=5E8413A8\u0026oh=d6ba6cb583afd7f341f6844c0fd02dbf"
jhasbvvlb
duyd7f97tyqubgjn ] \
f;vjnus0fjgr9eguer
Vn d[sb]-u54ldb
"video_url": ---
"video_url": "https://www.google.com"
'''
Then the following will give you a list of video_urls.
import re
re.findall('\"video_url\":\s*\"(.[^\s]*)\"\s', content)
Output:
['https://scontent-lhr8-1.cdninstagram.com/v/t50.2886-16/90894630_221502022556337_2214905061309385826_n.mp4?_nc_ht=scontent-lhr8-1.cdninstagram.com&_nc_cat=109&_nc_ohc=waOdsa3MtFcAX83adIS&oe=5E8413A8&oh=d6ba6cb583afd7f341f6844c0fd02dbf',
'https://www.google.com']
I would also encourage you to learn more about application of regular expressions in python.
See this: https://developers.google.com/edu/python/regular-expressions
Upvotes: 1