Reputation: 351
In a given .html page, I have a script tag like so:
<script>jQuery(window).load(function () {
setTimeout(function(){
jQuery("input[name=Email]").val("[email protected]");
}, 1000);
});</script>
How can I use Beautiful Soup to extract the email address?
Upvotes: 23
Views: 52159
Reputation: 861
In order to get the string inside the <script>
tag, you can use .contents
or .string
.
data = """
<body>
<script>jQuery(window).load(function () {
setTimeout(function(){
jQuery("input[name=Email]").val("[email protected]");
}, 1000);
});</script>
</body>
"""
soup = BeautifulSoup(data, "html.parser")
script = soup.find("script")
inner_text_with_string = script.string
inner_text_with_content = script.contents[0]
print('inner_text_with_string', inner_text_with_string)
print('inner_text_with_content', inner_text_with_content)
Upvotes: 2
Reputation: 10528
You could solve this with just a couple of lines of gazpacho and .split
, no regex required!
from gazpacho import Soup
html = """\
<script>jQuery(window).load(function () {
setTimeout(function(){
jQuery("input[name=Email]").val("[email protected]");
}, 1000);
});</script>
"""
soup = Soup(html)
string = soup.find("script").text
string.split(".val(\"")[-1].split("\");")[0]
Which would output:
'[email protected]'
Upvotes: 1
Reputation: 13585
I ran into a similar problem and the issue seems to be that calling script_tag.text
returns an empty string. Instead, you have to call script_tag.string
. Maybe this changed in some version of BeautifulSoup?
Anyway, @alecxe's answer didn't work for me, so I modified their solution:
import re
from bs4 import BeautifulSoup
data = """
<body>
<script>jQuery(window).load(function () {
setTimeout(function(){
jQuery("input[name=Email]").val("[email protected]");
}, 1000);
});</script>
</body>
"""
soup = BeautifulSoup(data, "html.parser")
script_tag = soup.find("script")
if script_tag:
# contains all of the script tag, e.g. "jQuery(window)..."
script_tag_contents = script_tag.string
# from there you can search the string using a regex, etc.
email = re.search(r'\.+val\("(.+)"\);', script_tag_contents).group(1)
print(email)
This prints [email protected]
.
Upvotes: 16
Reputation: 473763
To add a bit more to the @Bob's answer and assuming you need to also locate the script
tag in the HTML which may have other script
tags.
The idea is to define a regular expression that would be used for both locating the element with BeautifulSoup
and extracting the email
value:
import re
from bs4 import BeautifulSoup
data = """
<body>
<script>jQuery(window).load(function () {
setTimeout(function(){
jQuery("input[name=Email]").val("[email protected]");
}, 1000);
});</script>
</body>
"""
pattern = re.compile(r'\.val\("([^@]+@[^@]+\.[^@]+)"\);', re.MULTILINE | re.DOTALL)
soup = BeautifulSoup(data, "html.parser")
script = soup.find("script", text=pattern)
if script:
match = pattern.search(script.text)
if match:
email = match.group(1)
print(email)
Prints: [email protected]
.
Here we are using a simple regular expression for the email address, but we can go further and be more strict about it but I doubt that would be practically necessary for this problem.
Upvotes: 23
Reputation: 6173
not possible using only BeautifulSoup, but you can do it for example with BS + regular expressions
import re
from bs4 import BeautifulSoup as BS
html = """<script> ... </script>"""
bs = BS(html)
txt = bs.script.get_text()
email = re.match(r'.+val\("(.+?)"\);', txt).group(1)
or like this:
...
email = txt.split('.val("')[1].split('");')[0]
Upvotes: 3