w.c
w.c

Reputation: 15

Having problems following links with webcrawler

I am trying to create a webcrawler that parses all the html on the page, grabs a specified (via raw_input) link, follows that link, and then repeats this process a specified number of times (once again via raw_input). I am able to grab the first link and successfully print it. However, I am having problems "looping" the whole process, and usually grab the wrong link. This is the first link

https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html

(Full disclosure, this questions pertains to an assignment for a Coursera course)

Here's my code

import urllib
from BeautifulSoup import *
url = raw_input('Enter - ')
rpt=raw_input('Enter Position')
rpt=int(rpt)
cnt=raw_input('Enter Count')
cnt=int(cnt)
count=0
counts=0
tags=list()
soup=None
while x==0:
    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
    tags=soup.findAll('a')
    for tag in tags:
        url= tag.get('href')
        count=count + 1
        if count== rpt:
            break
counts=counts + 1
if counts==cnt:        
    x==1       
else: continue
print  url

Upvotes: 1

Views: 1247

Answers (4)

Martien Lubberink
Martien Lubberink

Reputation: 2735

Here is my 2-cents:

import urllib
#import ssl
from bs4 import BeautifulSoup
#'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
url = raw_input('Enter URL : ')
position = int(raw_input('Enter position : '))
count = int(raw_input('Enter count : '))

print('Retrieving: ' + url)
soup = BeautifulSoup(urllib.urlopen(url).read())

for x in range(1, count + 1):
    link = list()
    for tag in soup('a'):
        link.append(tag.get('href', None))    
    print('Retrieving: ' + link[position - 1])
    soup = BeautifulSoup(urllib.urlopen(link[position - 1]).read())

Upvotes: 0

S.Martinin
S.Martinin

Reputation: 11

I also worked on that course, and help with a friend, I got this worked out:

import urllib
from bs4 import BeautifulSoup

url = "http://python-data.dr-chuck.net/known_by_Happy.html"
rpt=7
position=18

count=0
counts=0
tags=list()
soup=None
x=0
while x==0:
    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html,"html.parser")
    tags=soup.findAll('a')
    url= tags[position-1].get('href')
    count=count + 1
    if count == rpt:
        break

print  url

Upvotes: 1

lutz-the-lion
lutz-the-lion

Reputation: 11

Based on DJanssens' response, I found the solution;

url = tags[position-1].get('href')

did the trick for me!

Thanks for the assistance!

Upvotes: 1

DJanssens
DJanssens

Reputation: 20799

I believe this is what you are looking for:

import urllib
from bs4 import *
url = raw_input('Enter - ')
position=int(raw_input('Enter Position'))
count=int(raw_input('Enter Count'))

#perform the loop "count" times.
for _ in xrange(0,count):
    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html)
    tags=soup.findAll('a')
    for tag in tags:
        url= tag.get('href')
        tags=soup.findAll('a')
        # if the link does not exist at that position, show error.
        if not tags[position-1]:
            print "A link does not exist at that position."
        # if the link at that position exist, overwrite it so the next search will use it.
        url = tags[position-1].get('href')
print url

The code will now loop the amount of times as specified in the input, each time it will take the href at the given position and replace it with the url, in that way each loop will look further in the tree structure.

I advice you to use full names for variables, which is a lot easier to understand. In addition you could cast them and read them in a single line, which makes your beginning easier to follow.

Upvotes: 0

Related Questions