Reputation: 85
I'm working on "Video Downloader" and I have one problem with BeautifulSoup4.
Here is part of html which from I want to get a href:
<script src="/static/common.js?v7"></script>
<script type="text/javascript">
var c = 6;
window.onload = function() {
count();
}
function closeAd(){
$("#easy-box").hide();
}
function notLogedIn(){
$("#not-loged-in").html("You need to be logged in to download this movie!");
}
function count() {
if(document.getElementById('countdown') != null){
c -= 1;
//If the counter is within range we put the seconds remaining to the <span> below
if (c >= 0)
if(c == 0){
document.getElementById('countdown').innerHTML = '';
}
else {
document.getElementById('countdown').innerHTML = c;
}
else {
document.getElementById('download-link').innerHTML = '<a style="text-decoration:none;" href="http://s896.vshare.io/download,9999999999999999999999999999999999999999-f6192405453bf5ff3cfe41a488d8390d,5944ed28,4d948c5.avi">Click here</a> to download requested file.';
return;
}
//setTimeout('count()', 1000);
}
}
</script>
<script type="text/javascript" src="/static/flowplayer/flowplayer-3.2.13.min.js"></script>
And here is href which I want to print:
href="http://s896.vshare.io/download,9999999999999999999999999999999999999999-f6192405453bf5ff3cfe41a488d8390d,5944ed28,4d948c5.avi"
I was trying with this, but it's not working.
for a in soup3.find_all('a'):
if 'href' in a.attrs:
print(a['href'])
Upvotes: 1
Views: 2153
Reputation: 510
Beautiful Soup can parse HTML and XML, not JavaScript.
You can use regular expression to search this code.
Using <a [^>]*?(href=\"([^\">]+)\")
you can match everything inside this code which:
<a
- is an a
tag[^>]*?
- can have any characters that are not >
href="
- have href[^\">]+
- have any number of characters other than "
and >
To extract script code from html you can use
script = soup.find('script', {'type': 'text/javascript'})
and then to parse it, use
re.search(r"<a [^>]*?(href=\"([^\">]+)\")", script.text)
Remember to import re
first.
print(re.search(r"<a [^>]*?(href=\"([^\">]+)\")", script.text)[1])
# href="http://s896.vshare.io/download,9999999999999999999999999999999999999999-f6192405453bf5ff3cfe41a488d8390d,5944ed28,4d948c5.avi
print(re.search(r"<a [^>]*?(href=\"([^\">]+)\")", script.text)[2])
# http://s896.vshare.io/download,9999999999999999999999999999999999999999-f6192405453bf5ff3cfe41a488d8390d,5944ed28,4d948c5.avi
Read about regular expression. If you are going to use pattern often, compile it first.
https://docs.python.org/3/library/re.html
Upvotes: 2