Beautiful Soup Web Scraping Hanging on Website

Question

I am using python and beautiful soup to loop through a bunch of websites that I have stored in a stack and take the visible text from them (which I am doing through a defined function). Each time that my loop tries to go through this website: https://blogs.oracle.com/cloud-infrastructure/use-the-nuage-networks-virtualized-services-platform-with-oracle-cloud.

It gets hung on response = http.request('GET', scan_url) without producing any error. Nothing really happens no matter how long I seem to wait. I would consider just skipping this website however as I am using this to learn I am trying to see if I can fix this issue yet have found no solutions. It may simply be due to a lack of my understanding with regards to what defines websites you can scrape or cannot.

Here is the code snippet that is relevant to this:

for elems in stack:
    print("Looking at Site")
    print(elems)
    print("Here")
    http = urllib3.PoolManager()
    scan_url = elems
    print("Here2")
    try:
        response = http.request('GET', scan_url)
    except:
        confirmed_stack.append("Cannot Scrape: "+elems+"")
        continue
    print("Here3")
    soup = BeautifulSoup(response.data, 'html.parser')
    print("Yer")
    texts = soup.find_all(text=True)
    visible_texts = filter(tag_visible, texts)
    # print("HTML")
    pause_rand2 = random.randint(1, 2)
    time.sleep(pause_rand2)

I have imported the following into libraries in python although not all are relevant or used since I was playing around with them:

from googlesearch import search
from bs4 import BeautifulSoup
from bs4.element import Comment
import random
import urllib3
import math
import re
from collections import Counter
import os
import time
from translate import Translator
import requests

chitown88 · Accepted Answer

Try include user agent within the headers:

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}  #<-- add this line


for elems in stack:
                print("Looking at Site")
                print(elems)
                print("Here")
                http = urllib3.PoolManager()
                scan_url = elems
                print("Here2")
                try:
                    response = requests.get(scan_url, headers=headers) #<-- adjusted this line
                except:
                    confirmed_stack.append("Cannot Scrape: "+elems+"")
                    continue
                print("Here3")
                soup = BeautifulSoup(response.text, 'html.parser') #<-- minor change here
                print("Yer")
                texts = soup.find_all(text=True)
                visible_texts = filter(tag_visible, texts) 
                #print("HTML")
                pause_rand2 = random.randint(1,2)
                time.sleep(pause_rand2)

Beautiful Soup Web Scraping Hanging on Website

Answers (1)

Related Questions