Reputation: 117
I am using python and beautiful soup to loop through a bunch of websites that I have stored in a stack and take the visible text from them (which I am doing through a defined function). Each time that my loop tries to go through this website: https://blogs.oracle.com/cloud-infrastructure/use-the-nuage-networks-virtualized-services-platform-with-oracle-cloud.
It gets hung on response = http.request('GET', scan_url)
without producing any error. Nothing really happens no matter how long I seem to wait. I would consider just skipping this website however as I am using this to learn I am trying to see if I can fix this issue yet have found no solutions. It may simply be due to a lack of my understanding with regards to what defines websites you can scrape or cannot.
Here is the code snippet that is relevant to this:
for elems in stack:
print("Looking at Site")
print(elems)
print("Here")
http = urllib3.PoolManager()
scan_url = elems
print("Here2")
try:
response = http.request('GET', scan_url)
except:
confirmed_stack.append("Cannot Scrape: "+elems+"")
continue
print("Here3")
soup = BeautifulSoup(response.data, 'html.parser')
print("Yer")
texts = soup.find_all(text=True)
visible_texts = filter(tag_visible, texts)
# print("HTML")
pause_rand2 = random.randint(1, 2)
time.sleep(pause_rand2)
I have imported the following into libraries in python although not all are relevant or used since I was playing around with them:
from googlesearch import search
from bs4 import BeautifulSoup
from bs4.element import Comment
import random
import urllib3
import math
import re
from collections import Counter
import os
import time
from translate import Translator
import requests
Upvotes: 0
Views: 284
Reputation: 28595
Try include user agent within the headers:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'} #<-- add this line
for elems in stack:
print("Looking at Site")
print(elems)
print("Here")
http = urllib3.PoolManager()
scan_url = elems
print("Here2")
try:
response = requests.get(scan_url, headers=headers) #<-- adjusted this line
except:
confirmed_stack.append("Cannot Scrape: "+elems+"")
continue
print("Here3")
soup = BeautifulSoup(response.text, 'html.parser') #<-- minor change here
print("Yer")
texts = soup.find_all(text=True)
visible_texts = filter(tag_visible, texts)
#print("HTML")
pause_rand2 = random.randint(1,2)
time.sleep(pause_rand2)
Upvotes: 1