Reputation: 2441
I am trying to extract java script from google.com
using regular expression.
Program
import urllib
import re
gdoc = urllib.urlopen('http://google.com').read()
scriptlis = re.findall(r'<script>(.*?)</script>', gdoc)
print scriptlis
Output:
['']
Can any one tell me how to extract java script from html doc by using regular expression only.
Upvotes: 2
Views: 2945
Reputation:
This works:
import urllib
import re
gdoc = urllib.urlopen('http://google.com').read()
scriptlis = re.findall('(?si)<script>(.*?)</script>', gdoc)
print scriptlis
The key here is (?si)
. The "s" sets the "dotall" flag (same as re.DOTALL
), which makes Regex match over newlines. That was actually the root of your problem. The scripts on google.com span multiple lines, so Regex can't match them unless you tell it to include newlines in (.*?)
.
The "i" sets the "ignorcase" flag (same as re.IGNORECASE
), which allows it to match anything that can be JavaScript. Now, this isn't entirely necessary because Google codes pretty well. But, if you had poor code that did stuff similar to <SCRIPT>...</SCRIPT>
, you will need this flag.
Upvotes: 5
Reputation: 174622
If you don't have an issue with third party libraries, requests
combined with BeautifulSoup
makes for a great combination:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://www.google.com')
p = bs(r.content)
p.find_all('script')
Upvotes: 1
Reputation: 133929
What you probably could try to do is
scriptlis = re.findall(r'<script\s*([^>]*)\s*>(.*?)</script', gdoc, re.I|re.S)
Because most script tags are of type:
<script language="javascript" src="foo"></script>
or
<script language="javascript">alert("foo")</script>
and some even are <SCRIPT></SCRIPT>
Neither of which match your regex. My regex would grab attributes in group 1, and the possible inline code in group 2. And also all tags within HTML comments. But it is about the best possible without BeautifulSoup et al
Upvotes: 0
Reputation: 9112
I think the problem is that the text between <script>
and </script>
is several lines, so you could try something like this:
rg = re.compile('<script>(.*)</script>', re.DOTALL)
result = re.findall(rg, gdoc)
Upvotes: 0