Reputation: 13
I'm facing problem in linking the links together. i need spider code who interlinks the links on the pages and grab me the required details until now my code is able to grab the required information but there are other pages too so i need other pages information too link the base_url contains the applications info then i want to collect all the links from that page and then want to switch next page and repeat the same thing then i need to collect the each application details like their names, version no etc from the links i have been collected
so right now im able to collect all the information only links are not inter linked how i can do that help me out..... here is my code:
#extracting links
def linkextract(soup):
print "\n extracting links of next pages"
print "\n\n page 2 \n"
sAll = [div.find('a') for div in soup.findAll('div', attrs={'class':''})]
for i in sAll:
suburl = ""+i['href'] #checking pages
print suburl
pages = mech.open(suburl)
content = pages.read()
anosoup = BeautifulSoup(content)
extract(anosoup)
app_url = ""
print app_url
#print soup.prettify()
page1 = mech.open(app_url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
print "\n\n application page details \n"
extractinside(soup1)
assistance required thank you.
Upvotes: 0
Views: 147
Reputation: 474161
Here's what you should start with:
import urllib2
from bs4 import BeautifulSoup
URL = 'http://www.pcwelt.de/download-neuzugaenge.html'
soup = BeautifulSoup(urllib2.urlopen(URL))
links = [tr.td.a['href'] for tr in soup.find('div', {'class': 'boxed'}).table.find_all('tr') if tr.td]
for link in links:
url = "http://www.pcwelt.de{0}".format(link)
soup = BeautifulSoup(urllib2.urlopen(url))
name = soup.find('span', {'itemprop': 'name'}).text
version = soup.find('td', {'itemprop': 'softwareVersion'}).text
print "Name: %s; Version: %s" % (name, version)
prints:
Name: Ashampoo Clip Finder HD Free; Version: 2.3.6
Name: Many Cam; Version: 4.0.63
Name: Roboform; Version: 7.9.5.7
...
Hope that helps.
Upvotes: 2