Reputation: 617
I've used BeautifulSoup to get the below snippet from an HTML page. I'm having trouble stripping out just the JSON (after FB_DATA). I'm guessing I need to use re.search, but I'm having trouble with the REGEX.
The snippet is:
<script type="text/javascript">
var FB_DATA = {
"foo": bar,
"two": {
"foo": bar,
}};
var FB_PUSH = [];
var FB_PULL = [];
</script>
Upvotes: 1
Views: 2193
Reputation: 20486
I'm assuming your main issue is using a .*?
when .
matches anything but new lines. Using the s
dot-matches-newline modifier, you can accomplish this very simply:
(?s) (?# dot-match-all modifier)
var (?# match var literally)
\s+ (?# match 1+ whitespace)
FB_DATA (?# match FB_DATA literally)
\s* (?# match 0+ whitespace)
= (?# match = literally)
\s* (?# match 0+ whitespace)
( (?# start capture group)
\{ (?# match { literally)
.*? (?# lazily match 0+ characters)
\} (?# match } literally)
) (?# end capture group)
; (?# match ; literally)
Your JSON string will be in capture group #1.
m = re.search(r"(?s)var\s+FB_DATA\s*=\s*(\{.*?\});", html)
print m.group(1)
Upvotes: 6
Reputation: 7562
start with
FB_DATA = (\{[^;]*;)
and see in which cases it's not enough.
Upvotes: 0