Nat
Nat

Reputation: 1802

Get a string from page source

I've got a script to get the page source of a webpage and there's a value I'm trying to get from it but the returned string is a mixtured of html, json and JavaScript. I'd show you the page source but some parts of it have sensitive data as I haven't gotten around to fix that yet. If you need an example what I get back then I can make something up, other than that, this is the small python script so far:

import requests as r


def app(url):
    content = ""

    request = r.get(url);
    content = request.content
    print(content)

I tried to find the string by a simple string.find() command. I have no clue how to throw away all the useless bits of the result, it's not like I can just parse the json part and store it somewhere (which would then allow me access to the value easily), is it?

Thanks.

EDIT:

Here's an example input, and the output (not what my script actually goes for but I remembered the page source from Instagram posts is similar.

Input:

view-source:https://www.instagram.com/p/B-U4-cVAp5y/

Output: Link to file is here, can't add it to the question as it's so large.

There's is a json part at the bottom of the code, inside the json, somewhere is a value called 'video_url' and I am trying to get that value, but obviously not on Instagram. I have stripped the json from the full result and made it prettier so you can see it easily, which you can find here and the value I'm trying to retrieve looks like this:

"video_url":"https://scontent-lhr8-1.cdninstagram.com/v/t50.2886-16/90894630_221502022556337_2214905061309385826_n.mp4?_nc_ht=scontent-lhr8-1.cdninstagram.com\u0026_nc_cat=109\u0026_nc_ohc=waOdsa3MtFcAX83adIS\u0026oe=5E8413A8\u0026oh=d6ba6cb583afd7f341f6844c0fd02dbf"

I can't get to that json, however as there is so much stuff going on, I can't find a decent module to search it.

Upvotes: 0

Views: 1790

Answers (2)

Ray B
Ray B

Reputation: 151

I want to share a couple other approaches that use Beautiful Soup. There could be some advantages over simply using a regular expression since this parses the page data similar to how a real web browser would.

# Sample content based on the format of <https://pastebin.com/raw/YGPupvjj>
content = '''
<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Fake Page</title>
    <script type="text/javascript">
    (function() { var xyz = 'Some other irrelevant script block'; })();
    </script>
  </head>
  <body>
    <p>Dummy body content</p>
    <script type="text/javascript">
        window._sharedData = {
            "entry_data": {
                "PostPage": [{
                    "graphql": {
                        "shortcode_media": {
                            "edge_media_to_tagged_user": {
                                "edges": [{
                                    "node": {
                                        "user": {
                                            "full_name": "John Doe",
                                            "id": "132389782",
                                            "is_verified": false,
                                            "profile_pic_url": "https://example.com/something.jpg",
                                            "username": "johndoe"
                                        }
                                    }
                                }]
                            }
                        }
                    }
                }]
            }
        };
    </script>
  </body>
</html>
'''

If instead you want to try this with actual page data, you can fetch it:

import requests
request = requests.get('https://pastebin.com/raw/YGPupvjj')
content = request.content

Use Beautiful Soup to parse the web content:

from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')

Beautiful Soup gives us easy access to the <script> blocks that contain your data, but it only returns it as a string. It can't parse JavaScript. Here are two ways to extract the data.

Approach #1: Find the JSON data using a regular expression, use Pythons json library to parse it, and search the loaded JSON data.

import json
import re

# Search JSON data recursively and yield any dict item value with
# key "profile_pic_url"
def search(d):

    if isinstance(d, list):
        for x in d:
            yield from search(x)
        return

    if not isinstance(d, dict):
        return

    url = d.get('profile_pic_url')
    if url:
        yield url

    for v in d.values():
        yield from search(v)


for script_block in soup.find_all('script'):

    if not script_block.string:
        continue

    m = re.fullmatch(r'(?s)\s*window\._sharedData\s*=\s*({.*\});\s*', script_block.string)

    if m is not None:
        data = json.loads(m.group(1))
        for x in search(data):
            print(x)

Approach #2: Use pyjsparser to parse the JavaScript <script> blocks, and search for the literal key in the parsed syntax tree.

import pyjsparser

# Search the syntax tree recursively and yield value of
# JS Object property with literal key "profile_pic_url"
def search(d):

    if isinstance(d, list):
        for i, x in enumerate(d):
            yield from search(x)

    if not isinstance(d, dict):
        return

    if d['type'] == 'ObjectExpression':
        for p in d['properties']:
            if (p['key']['type'] == 'Literal'
                    and p['value']['type'] == 'Literal'
                    and p['key']['value'] == 'profile_pic_url'):
                yield p['value']['value']
            yield from search(p['key'])
            yield from search(p['value'])
        return

    for k, v in d.items():
        yield from search(v)

for script_block in soup.find_all('script'):

    if not script_block.string:
        continue

    try:
        code = pyjsparser.parse(script_block.string)
    except pyjsparser.JsSyntaxError:
        continue

    for found in search(code):
        print(found)

Upvotes: 0

CypherX
CypherX

Reputation: 7353

Solution

You can use regular expressions (regex) to do this. You need to import re and then use the following to get a list of all the video_urls

import re
re.findall('\"video_url\":\s*\"(.[^\s]*)\"\s', str(content))

Dummy Data

# suppose this is the text in your "content"
content = '''
"video_url":"https://scontent-lhr8-1.cdninstagram.com/v/t50.2886-16/90894630_221502022556337_2214905061309385826_n.mp4?_nc_ht=scontent-lhr8-1.cdninstagram.com\u0026_nc_cat=109\u0026_nc_ohc=waOdsa3MtFcAX83adIS\u0026oe=5E8413A8\u0026oh=d6ba6cb583afd7f341f6844c0fd02dbf"

jhasbvvlb
duyd7f97tyqubgjn ] \
f;vjnus0fjgr9eguer
Vn d[sb]-u54ldb 
"video_url": ---
"video_url": "https://www.google.com"
'''

Code

Then the following will give you a list of video_urls.

import re
re.findall('\"video_url\":\s*\"(.[^\s]*)\"\s', content)

Output:

['https://scontent-lhr8-1.cdninstagram.com/v/t50.2886-16/90894630_221502022556337_2214905061309385826_n.mp4?_nc_ht=scontent-lhr8-1.cdninstagram.com&_nc_cat=109&_nc_ohc=waOdsa3MtFcAX83adIS&oe=5E8413A8&oh=d6ba6cb583afd7f341f6844c0fd02dbf',
 'https://www.google.com']

References

I would also encourage you to learn more about application of regular expressions in python.

See this: https://developers.google.com/edu/python/regular-expressions

Upvotes: 1

Related Questions