wan mohd payed
wan mohd payed

Reputation: 171

how to get download link in python using beautifulsoup?

i want to get the download link. For example in http://www.brothersoft.com/windows/top-downloads/

so the expected result should be:

List of url:
 1. http://www.brothersoft.com/photoscape-64604.html
 2. http://www.brothersoft.com/orbit-downloader-54366.html
 3. ....
 4. ...
 till 100.

I have tried this code:

 import urllib
 from bs4 import BeautifulSoup

 pageFile = urllib.urlopen("http://www.brothersoft.com/windows/top-downloads/")

 pageHtml = pageFile.read()

 pageFile.close()

 soup = BeautifulSoup("".join(pageHtml))

 sAll = soup.findAll("a")

 for i in range (0,100)
    for link in sAll:
      print i,link

But it give incorrect output. Thanks

Upvotes: 0

Views: 287

Answers (1)

user2629998
user2629998

Reputation:

First of all, BeautifulSoup("".join(pageHtml)) is not needed since pageHtml is already a string, so you can just use that directly like BeautifulSoup(pageHtml).

for i in range (0,100)

If you're using Python 2 (which I think you do, since Python 3's urllib has no urlopen), you should use xrange(100) instead, it's a bit faster, also it's not needed to include the first zero if you're counting from zero, so xrange(100) will do just fine.

Also, you have a syntax error, no space between range and (0,100) and add a : directly after it.

Finally, your code will just print all the links 100 times, which is not what you need; if you only need the first 100 links you should use something like this :

for i in xrange(100): print sAll[i]["href"]

This basically makes a list from 0 to 100, and iterates over it, the current value is in i, then it uses that i as an index for retrieving data from the sAll array (obviously this will throw a KeyError exception if the sAll list is smaller than 100), and finally it prints the "href" attribute of that item, which is the link target.

If you wanted to also print the number, you could use print i, sAll[i]["href"] instead.

Upvotes: 1

Related Questions