Reputation: 1048
I'm new with Python and I'm trying to use BeautifulSoup to extract some data from a variable defined in a script.
data = soup.find_all('script', type='text/javascript')
print(data[0])
<script type="text/javascript">
var myvar = {
productid: "101",
productname: "Abc",
};
</script>
Do you know an easy way to extract the 'productid' and 'productname' from the myvar variable?
Upvotes: 1
Views: 1888
Reputation: 19184
For simple way I will use Regex
import re
.....
data = soup.find_all('script', type='text/javascript')
productid = re.search(r'productid:\s*"(.*?)"', data[0].text).group(1)
print(productid)
Upvotes: 0
Reputation: 338406
There's two ways. Easy, and wrong. Or not quite as easy, but correct.
I'm not going to recommend the easy way to you. The correct way is to use a Javascript parser. For modern Javascript, esprima is a good choice. There is an interactive online demo and it's also available as a Python module.
import esprima
# script body as extracted from beautifulsoup
script_text = """
var myvar = {
productid: "101",
productname: "Abc",
};
""";
tokens = esprima.tokenize(script_text)
In this simple script there is not a lot going on. The list of raw tokens would be enough to get to the values you want. It looks like this:
[
{
"type": "Keyword",
"value": "var"
},
{
"type": "Identifier",
"value": "myvar"
},
{
"type": "Punctuator",
"value": "="
},
{
"type": "Punctuator",
"value": "{"
},
{
"type": "Identifier",
"value": "productid"
},
{
"type": "Punctuator",
"value": ":"
},
{
"type": "String",
"value": "\"101\""
},
{
"type": "Punctuator",
"value": ","
},
{
"type": "Identifier",
"value": "productname"
},
{
"type": "Punctuator",
"value": ":"
},
{
"type": "String",
"value": "\"Abc\""
},
{
"type": "Punctuator",
"value": ","
},
{
"type": "Punctuator",
"value": "}"
},
{
"type": "Punctuator",
"value": ";"
}
]
Iterate the list and pick the values you need.
token_iterator = iter(tokens)
for token in token_iterator:
if token["type"] == "Identifier" and token["value"] == "productname":
# the token after the next must be the one that holds the associated value
value_token = next(next(token_iterator))
productname = value_token["value"]
For more complex situations, parsing the script into a tree and walking the tree might become necessary.
tree = esprima.parse(script_text)
The tree is more complex (you can view it on the interactive page), but in exchange it carries all the context information that is missing from the plain token list. You would then use the visitor pattern to walk this tree to a specific place. The Python package has an example how to use the visitor pattern if you're interested.
Upvotes: 1
Reputation: 11669
Parse
from bs4 import BeautifulSoup
script_data='''
<script type="text/javascript">
var myvar = {
productid: "101",
productname: "Abc",
};
</script>
'''
soup = BeautifulSoup(script_data)
soup.script.string
holds the data inside script
tag as string. You can use split
on string to get positional data:
soup.script.string.split()
Output:
['var',
'myvar',
'=',
'{',
'productid:',
'"101",',
'productname:',
'"Abc",',
'};']
product_id:
soup.script.string.split()[5].split('"')[1]
Output:
'101'
product_name:
soup.script.string.split()[7].split('"')[1]
Output:
'Abc'
Upvotes: 0