Reputation: 171
i want to get the download link. For example in http://www.brothersoft.com/windows/top-downloads/
so the expected result should be:
List of url:
1. http://www.brothersoft.com/photoscape-64604.html
2. http://www.brothersoft.com/orbit-downloader-54366.html
3. ....
4. ...
till 100.
I have tried this code:
import urllib
from bs4 import BeautifulSoup
pageFile = urllib.urlopen("http://www.brothersoft.com/windows/top-downloads/")
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
sAll = soup.findAll("a")
for i in range (0,100)
for link in sAll:
print i,link
But it give incorrect output. Thanks
Upvotes: 0
Views: 287
Reputation:
First of all, BeautifulSoup("".join(pageHtml))
is not needed since pageHtml is already a string, so you can just use that directly like BeautifulSoup(pageHtml)
.
for i in range (0,100)
If you're using Python 2 (which I think you do, since Python 3's urllib has no urlopen
), you should use xrange(100) instead, it's a bit faster, also it's not needed to include the first zero if you're counting from zero, so xrange(100) will do just fine.
Also, you have a syntax error, no space between range
and (0,100)
and add a :
directly after it.
Finally, your code will just print all the links 100 times, which is not what you need; if you only need the first 100 links you should use something like this :
for i in xrange(100):
print sAll[i]["href"]
This basically makes a list from 0 to 100, and iterates over it, the current value is in i
, then it uses that i
as an index for retrieving data from the sAll
array (obviously this will throw a KeyError exception if the sAll list is smaller than 100), and finally it prints the "href" attribute of that item, which is the link target.
If you wanted to also print the number, you could use print i, sAll[i]["href"]
instead.
Upvotes: 1