Reputation: 153
In one of my script I use urllib2
and BeautifulSoup
to parse a HTML page and read a <script>
tag.
This is what I get :
<script>
var x_data = {
logged: logged,
lengthcarrousel: 2,
products : [
{
"serial" : "106541823"
...
</script>
My goal is to read the JSON in the x_data
variable and I do not know how to do it properly.
I though of :
I don't know if these are efficient and if there is some other ways to do it in a nice way.
Do you think a method is preferable to the other ? any method I may not be aware of ?
Thank you in advance for any advice.
EDIT :
Following advice I get the Regexp solution but I can't search in multiple lines despite using re.MULTILINE :
string1 = '<script>
var x_data = {
logged: logged,
lengthcarrousel: 2,
products : [
{
"serial" : "106541823"}
]
};
</script>'
p = re.compile(r'\{.*\};',re.MULTILINE);
m = p.search(string1)
if m:
print m.group(0)
else:
print "Error !"
I always got an "Error !".
EDIT2 :
Works well with re.DOTALL
.
Upvotes: 1
Views: 1104
Reputation: 95402
If it always looks exactly like this, then you can hack a solution like the one you proposed, based on it looking exactly like this.
Because programmers do everything in code, I suspect in practice it will not alway look exactly this, and then any hacky solution will be fragile and will fail at unexpected (read "impossibly inconvenient") moments. (Regex is known to be hacky when it comes to parsing code).
If you want to do this right, you will need to get a real JavaScript parser, apply it to the code fragment defined by the script tag content, to produce an AST, then search the AST for JavaScript nested structures that happen to look like JSON, and take the content of that tree, prettyprinted.
Even this will be fragile in the face of a programmer who assembles the JSON fragment using JavaScript assignment statements. You can handle this by computing data flow, and discovering sets of code that happen to assemble JSON code. This is rather a lot of work.
So you get to decide what the limits on your solution will be, and then accept the consequences when somebody you don't control does something random.
Upvotes: 2
Reputation: 3836
I think these methods are essentially the same in terms of elegance and performance (using {.*}
may be slightly better because .*
is greedy, i.e. there will be almost no backtracking, and because it seems to me more "forgiving" for different JS code formatting nuances). What you may be more interested in is this: https://docs.python.org/3.6/library/json.html.
Upvotes: 2