bart
bart

Reputation: 1048

Python - How to extract data from inside a variable within a script?

I'm new with Python and I'm trying to use BeautifulSoup to extract some data from a variable defined in a script.

data = soup.find_all('script', type='text/javascript')
print(data[0])

<script type="text/javascript">
  var myvar = {
    productid: "101",
    productname: "Abc",
  };
</script>

Do you know an easy way to extract the 'productid' and 'productname' from the myvar variable?

Upvotes: 1

Views: 1888

Answers (3)

ewwink
ewwink

Reputation: 19184

For simple way I will use Regex

import re

.....
data = soup.find_all('script', type='text/javascript')
productid = re.search(r'productid:\s*"(.*?)"', data[0].text).group(1)
print(productid)

Upvotes: 0

Tomalak
Tomalak

Reputation: 338406

There's two ways. Easy, and wrong. Or not quite as easy, but correct.

I'm not going to recommend the easy way to you. The correct way is to use a Javascript parser. For modern Javascript, esprima is a good choice. There is an interactive online demo and it's also available as a Python module.

import esprima

# script body as extracted from beautifulsoup
script_text = """
  var myvar = {
    productid: "101",
    productname: "Abc",
  };
""";

tokens = esprima.tokenize(script_text)

In this simple script there is not a lot going on. The list of raw tokens would be enough to get to the values you want. It looks like this:

[
    {
        "type": "Keyword",
        "value": "var"
    },
    {
        "type": "Identifier",
        "value": "myvar"
    },
    {
        "type": "Punctuator",
        "value": "="
    },
    {
        "type": "Punctuator",
        "value": "{"
    },
    {
        "type": "Identifier",
        "value": "productid"
    },
    {
        "type": "Punctuator",
        "value": ":"
    },
    {
        "type": "String",
        "value": "\"101\""
    },
    {
        "type": "Punctuator",
        "value": ","
    },
    {
        "type": "Identifier",
        "value": "productname"
    },
    {
        "type": "Punctuator",
        "value": ":"
    },
    {
        "type": "String",
        "value": "\"Abc\""
    },
    {
        "type": "Punctuator",
        "value": ","
    },
    {
        "type": "Punctuator",
        "value": "}"
    },
    {
        "type": "Punctuator",
        "value": ";"
    }
]

Iterate the list and pick the values you need.

token_iterator = iter(tokens)

for token in token_iterator:
    if token["type"] == "Identifier" and token["value"] == "productname":
        # the token after the next must be the one that holds the associated value
        value_token = next(next(token_iterator))
        productname = value_token["value"]

For more complex situations, parsing the script into a tree and walking the tree might become necessary.

tree = esprima.parse(script_text)

The tree is more complex (you can view it on the interactive page), but in exchange it carries all the context information that is missing from the plain token list. You would then use the visitor pattern to walk this tree to a specific place. The Python package has an example how to use the visitor pattern if you're interested.

Upvotes: 1

brokenfoot
brokenfoot

Reputation: 11669

Parse

from bs4 import BeautifulSoup

script_data='''
<script type="text/javascript">
  var myvar = {
    productid: "101",
    productname: "Abc",
  };
</script>
'''
soup = BeautifulSoup(script_data)

soup.script.string holds the data inside script tag as string. You can use split on string to get positional data:

soup.script.string.split()
Output:
['var',
 'myvar',
 '=',
 '{',
 'productid:',
 '"101",',
 'productname:',
 '"Abc",',
 '};']

product_id:

soup.script.string.split()[5].split('"')[1]
Output:
'101'

product_name:

soup.script.string.split()[7].split('"')[1]
Output:
'Abc'

Upvotes: 0

Related Questions