Get a string from page source

Question

I've got a script to get the page source of a webpage and there's a value I'm trying to get from it but the returned string is a mixtured of html, json and JavaScript. I'd show you the page source but some parts of it have sensitive data as I haven't gotten around to fix that yet. If you need an example what I get back then I can make something up, other than that, this is the small python script so far:

import requests as r


def app(url):
    content = ""

    request = r.get(url);
    content = request.content
    print(content)

I tried to find the string by a simple string.find() command. I have no clue how to throw away all the useless bits of the result, it's not like I can just parse the json part and store it somewhere (which would then allow me access to the value easily), is it?

Thanks.

EDIT:

Here's an example input, and the output (not what my script actually goes for but I remembered the page source from Instagram posts is similar.

Input:

view-source:https://www.instagram.com/p/B-U4-cVAp5y/

Output: Link to file is here, can't add it to the question as it's so large.

There's is a json part at the bottom of the code, inside the json, somewhere is a value called 'video_url' and I am trying to get that value, but obviously not on Instagram. I have stripped the json from the full result and made it prettier so you can see it easily, which you can find here and the value I'm trying to retrieve looks like this:

"video_url":"https://scontent-lhr8-1.cdninstagram.com/v/t50.2886-16/90894630_221502022556337_2214905061309385826_n.mp4?_nc_ht=scontent-lhr8-1.cdninstagram.com\u0026_nc_cat=109\u0026_nc_ohc=waOdsa3MtFcAX83adIS\u0026oe=5E8413A8\u0026oh=d6ba6cb583afd7f341f6844c0fd02dbf"

I can't get to that json, however as there is so much stuff going on, I can't find a decent module to search it.

CypherX · Accepted Answer

Solution

You can use regular expressions (regex) to do this. You need to import re and then use the following to get a list of all the video_urls

import re
re.findall('\"video_url\":\s*\"(.[^\s]*)\"\s', str(content))

Dummy Data

# suppose this is the text in your "content"
content = '''
"video_url":"https://scontent-lhr8-1.cdninstagram.com/v/t50.2886-16/90894630_221502022556337_2214905061309385826_n.mp4?_nc_ht=scontent-lhr8-1.cdninstagram.com\u0026_nc_cat=109\u0026_nc_ohc=waOdsa3MtFcAX83adIS\u0026oe=5E8413A8\u0026oh=d6ba6cb583afd7f341f6844c0fd02dbf"

jhasbvvlb
duyd7f97tyqubgjn ] \
f;vjnus0fjgr9eguer
Vn d[sb]-u54ldb 
"video_url": ---
"video_url": "https://www.google.com"
'''

Code

Then the following will give you a list of video_urls.

import re
re.findall('\"video_url\":\s*\"(.[^\s]*)\"\s', content)

Output:

['https://scontent-lhr8-1.cdninstagram.com/v/t50.2886-16/90894630_221502022556337_2214905061309385826_n.mp4?_nc_ht=scontent-lhr8-1.cdninstagram.com&_nc_cat=109&_nc_ohc=waOdsa3MtFcAX83adIS&oe=5E8413A8&oh=d6ba6cb583afd7f341f6844c0fd02dbf',
 'https://www.google.com']

References

I would also encourage you to learn more about application of regular expressions in python.

See this: https://developers.google.com/edu/python/regular-expressions

Get a string from page source

Answers (2)

Approach #1: Find the JSON data using a regular expression, use Pythons json library to parse it, and search the loaded JSON data.

Approach #2: Use pyjsparser to parse the JavaScript `<script>` blocks, and search for the literal key in the parsed syntax tree.

Solution

Dummy Data

Code

References

Related Questions

Get a string from page source

Answers (2)

Approach #1: Find the JSON data using a regular expression, use Pythons json library to parse it, and search the loaded JSON data.

Approach #2: Use pyjsparser to parse the JavaScript <script> blocks, and search for the literal key in the parsed syntax tree.

Solution

Dummy Data

Code

References

Related Questions

Approach #2: Use pyjsparser to parse the JavaScript `<script>` blocks, and search for the literal key in the parsed syntax tree.