Balakrishnan
Balakrishnan

Reputation: 2441

Extract java script from html document using regular expression

I am trying to extract java script from google.com using regular expression.

Program

import urllib
import re
gdoc = urllib.urlopen('http://google.com').read()
scriptlis = re.findall(r'<script>(.*?)</script>', gdoc)
print scriptlis

Output:

['']

Can any one tell me how to extract java script from html doc by using regular expression only.

Upvotes: 2

Views: 2945

Answers (4)

user2555451
user2555451

Reputation:

This works:

import urllib
import re
gdoc = urllib.urlopen('http://google.com').read()
scriptlis = re.findall('(?si)<script>(.*?)</script>', gdoc)
print scriptlis

The key here is (?si). The "s" sets the "dotall" flag (same as re.DOTALL), which makes Regex match over newlines. That was actually the root of your problem. The scripts on google.com span multiple lines, so Regex can't match them unless you tell it to include newlines in (.*?).

The "i" sets the "ignorcase" flag (same as re.IGNORECASE), which allows it to match anything that can be JavaScript. Now, this isn't entirely necessary because Google codes pretty well. But, if you had poor code that did stuff similar to <SCRIPT>...</SCRIPT>, you will need this flag.

Upvotes: 5

Burhan Khalid
Burhan Khalid

Reputation: 174622

If you don't have an issue with third party libraries, requests combined with BeautifulSoup makes for a great combination:

import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://www.google.com')
p = bs(r.content)
p.find_all('script')

Upvotes: 1

What you probably could try to do is

scriptlis = re.findall(r'<script\s*([^>]*)\s*>(.*?)</script', gdoc, re.I|re.S)

Because most script tags are of type:

<script language="javascript" src="foo"></script>

or

<script language="javascript">alert("foo")</script>

and some even are <SCRIPT></SCRIPT>

Neither of which match your regex. My regex would grab attributes in group 1, and the possible inline code in group 2. And also all tags within HTML comments. But it is about the best possible without BeautifulSoup et al

Upvotes: 0

PepperoniPizza
PepperoniPizza

Reputation: 9112

I think the problem is that the text between <script> and </script> is several lines, so you could try something like this:

rg = re.compile('<script>(.*)</script>', re.DOTALL)
result = re.findall(rg, gdoc)

Upvotes: 0

Related Questions