jestembotem
jestembotem

Reputation: 85

Python(BeautifulSoup) - Get href from <script>

I'm working on "Video Downloader" and I have one problem with BeautifulSoup4.

Here is part of html which from I want to get a href:

<script src="/static/common.js?v7"></script>
<script type="text/javascript">
            var c = 6;
            window.onload = function() {
                count();
            }

            function closeAd(){
                $("#easy-box").hide();
            }

            function notLogedIn(){
                $("#not-loged-in").html("You need to be logged in to download this movie!");
            }

            function count() {
                if(document.getElementById('countdown') != null){
                    c -= 1;
                    //If the counter is within range we put the seconds remaining to the <span> below 
                    if (c >= 0) 
                        if(c == 0){
                            document.getElementById('countdown').innerHTML = '';
                        }
                        else {
                            document.getElementById('countdown').innerHTML = c; 
                        }
                    else {
                        document.getElementById('download-link').innerHTML = '<a style="text-decoration:none;" href="http://s896.vshare.io/download,9999999999999999999999999999999999999999-f6192405453bf5ff3cfe41a488d8390d,5944ed28,4d948c5.avi">Click here</a> to download requested file.';
                        return;
                    }           
                    //setTimeout('count()', 1000);
                }
            }
        </script>
<script type="text/javascript" src="/static/flowplayer/flowplayer-3.2.13.min.js"></script>

And here is href which I want to print:

href="http://s896.vshare.io/download,9999999999999999999999999999999999999999-f6192405453bf5ff3cfe41a488d8390d,5944ed28,4d948c5.avi"

I was trying with this, but it's not working.

for a in soup3.find_all('a'):
    if 'href' in a.attrs:
        print(a['href'])

Upvotes: 1

Views: 2153

Answers (1)

Szymon
Szymon

Reputation: 510

Beautiful Soup can parse HTML and XML, not JavaScript. You can use regular expression to search this code.
Using <a [^>]*?(href=\"([^\">]+)\") you can match everything inside this code which:

  • <a - is an a tag
  • [^>]*? - can have any characters that are not >
  • href=" - have href
  • [^\">]+ - have any number of characters other than " and >

To extract script code from html you can use
script = soup.find('script', {'type': 'text/javascript'})
and then to parse it, use
re.search(r"<a [^>]*?(href=\"([^\">]+)\")", script.text)
Remember to import re first.

print(re.search(r"<a [^>]*?(href=\"([^\">]+)\")", script.text)[1])
# href="http://s896.vshare.io/download,9999999999999999999999999999999999999999-f6192405453bf5ff3cfe41a488d8390d,5944ed28,4d948c5.avi
print(re.search(r"<a [^>]*?(href=\"([^\">]+)\")", script.text)[2])
# http://s896.vshare.io/download,9999999999999999999999999999999999999999-f6192405453bf5ff3cfe41a488d8390d,5944ed28,4d948c5.avi

Read about regular expression. If you are going to use pattern often, compile it first.
https://docs.python.org/3/library/re.html

Upvotes: 2

Related Questions