user971956
user971956

Reputation: 3208

How to extract a JSON object that was defined in a HTML page javascript block using Python?

I am downloading HTML pages that have data defined in them in the following way:

... <script type= "text/javascript">    window.blog.data = {"activity":{"type":"read"}}; </script> ...

I would like to extract the JSON object defined in 'window.blog.data'. Is there a simpler way than parsing it manually? (I am looking into Beautiful Soap but can't seem to find a method that will return the exact object without parsing)

Thanks

Edit: Would it be possible and more correct to do this with a python headless browser (e.g., Ghost.py)?

Upvotes: 20

Views: 41995

Answers (4)

Amine Rizk
Amine Rizk

Reputation: 87

fast and easy way is ('here put exactly the start (.*?) and the end here') that's all !

import re
import json
html = """<!doctype html>
<title>extract javascript object as json</title>
<script>
// ..
window.blog.data = {"activity":{"type":"read"}};
// ..
</script>
<p>some other html here
"""

than simply

re.search('{"activity":{"type":"(.*?)"', html).group(1)

or for full json

jsondata = re.search('window.blog.data = (.*?);', html).group(1)
jsondata = json.loads(jsondata)
print(jsondata["activity"])

#output {'type': 'read'}

Upvotes: -1

user1071182
user1071182

Reputation: 1627

I had a similar issue and ended up using selenium with phantomjs. It's a little hacky and I couldn't quite figure out the correct wait until method, but the implicit wait seems to work fine so far for me.

from selenium import webdriver
import json
import re

url = "http..."
driver = webdriver.PhantomJS(service_args=['--load-images=no'])
driver.set_window_size(1120, 550)
driver.get(url)
driver.implicitly_wait(1)
script_text = re.search(r'window\.blog\.data\s*=.*<\/script>', driver.page_source).group(0)

# split text based on first equal sign and remove trailing script tag and semicolon
json_text = script_text.split('=',1)[1].rstrip('</script>').strip().rstrip(';').strip()
# only care about first piece of json
json_text = json_text.split("};")[0] + "}"
data = json.loads(json_text)

driver.quit()

```

Upvotes: 1

jfs
jfs

Reputation: 414675

BeautifulSoup is an html parser; you also need a javascript parser here. btw, some javascript object literals are not valid json (though in your example the literal is also a valid json object).

In simple cases you could:

  1. extract <script>'s text using an html parser
  2. assume that window.blog... is a single line or there is no ';' inside the object and extract the javascript object literal using simple string manipulations or a regex
  3. assume that the string is a valid json and parse it using json module

Example:

#!/usr/bin/env python
html = """<!doctype html>
<title>extract javascript object as json</title>
<script>
// ..
window.blog.data = {"activity":{"type":"read"}};
// ..
</script>
<p>some other html here
"""
import json
import re
from bs4 import BeautifulSoup  # $ pip install beautifulsoup4
soup = BeautifulSoup(html)
script = soup.find('script', text=re.compile('window\.blog\.data'))
json_text = re.search(r'^\s*window\.blog\.data\s*=\s*({.*?})\s*;\s*$',
                      script.string, flags=re.DOTALL | re.MULTILINE).group(1)
data = json.loads(json_text)
assert data['activity']['type'] == 'read'

If the assumptions are incorrect then the code fails.

To relax the second assumption, a javascript parser could be used instead of a regex e.g., slimit (suggested by @approximatenumber):

from slimit import ast  # $ pip install slimit
from slimit.parser import Parser as JavascriptParser
from slimit.visitors import nodevisitor

soup = BeautifulSoup(html, 'html.parser')
tree = JavascriptParser().parse(soup.script.string)
obj = next(node.right for node in nodevisitor.visit(tree)
           if (isinstance(node, ast.Assign) and
               node.left.to_ecma() == 'window.blog.data'))
# HACK: easy way to parse the javascript object literal
data = json.loads(obj.to_ecma())  # NOTE: json format may be slightly different
assert data['activity']['type'] == 'read'

There is no need to treat the object literal (obj) as a json object. To get the necessary info, obj can be visited recursively like other ast nodes. It would allow to support arbitrary javascript code (that can be parsed by slimit).

Upvotes: 16

Christian Thieme
Christian Thieme

Reputation: 1124

Something like this may work:

import re

HTML = """ 
<html>
    <head>
    ...
    <script type= "text/javascript"> 
window.blog.data = {"activity":
    {"type":"read"}
    };
    ...
    </script> 
    </head>
    <body>
    ...
    </body>
    </html>
"""

JSON = re.compile('window.blog.data = ({.*?});', re.DOTALL)

matches = JSON.search(HTML)

print matches.group(1)

Upvotes: 7

Related Questions