Reputation: 21
so I have been struggling with writing this web crawler for a few days and I don't know how to get it to work. I've been searching for similar questions and solutions, but I can't find anything, so please refer to another question if this has already been asked.
My web crawler is supposed to find n urls the first website is linked to, and then find x urls that these n urls are linked to and so forth until a certain depth is reached and with a certain amount of urls in each level. For example - I enter an URL I want to crawl, I find 3 linking URLs, 3 URLs linking to each of those 3 URLs and so forth. 1+3^1+3^2+3^4... urls. So far I've written this, but I can't get it to work as I want it to.
from bs4 import BeautifulSoup
import requests
url = 'http://www.baidu.com'
depth = 3 #3 levels
count = 3 #amount of urls in each level
def extractURL(url, depth, count):
list = []
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
tags = soup.find_all('a')
newtags = tags[:depth]
for link in newtags:
url2 = link.get('href')
if url2 is not None and url2.startswith("http"):
list.append(url2)
for url3 in list:
if(count > 0):
if not url3 is None and "http" in url:
print(url, "->", url3)
count = count-1
print("----------------") #divider for each url and connecting urls..?
extractURL(url3, depth, count)
extractURL(url, depth, count)
print("Done")
The point is for it to print "url ->(linking to) url2". I'm thinking my counter isn't working as it never resets, but I have no clue as to solve this. Thanks in advance!
Upvotes: 2
Views: 1287
Reputation: 96
You can use this code to appropriately extract links. You must separate each layer of links in order to eliminate duplicate link analysis
from bs4 import BeautifulSoup
import requests
url = 'http://www.baidu.com'
depth = 3 # 3 levels
count = 3 # amount of urls in each level
url_list_depth = [[] for i in range(0, depth + 1)]
url_list_depth[0].append(url)
for depth_i in range(0, depth):
for links in url_list_depth[depth_i]:
response = requests.get(links)
soup = BeautifulSoup(response.text, 'html.parser')
tags = soup.find_all('a')
for link in tags:
url_new = link.get('href')
flag = False
for item in url_list_depth:
for l in item:
if url_new == l:
flag = True
if url_new is not None and "http" in url_new and flag is False:
url_list_depth[depth_i + 1].append(url_new)
print(links, "->", url_new)
print('Done')
Upvotes: 3