Warwick Logan
Warwick Logan

Reputation: 117

Beautiful Soup Web Scraping Hanging on Website

I am using python and beautiful soup to loop through a bunch of websites that I have stored in a stack and take the visible text from them (which I am doing through a defined function). Each time that my loop tries to go through this website: https://blogs.oracle.com/cloud-infrastructure/use-the-nuage-networks-virtualized-services-platform-with-oracle-cloud.

It gets hung on response = http.request('GET', scan_url) without producing any error. Nothing really happens no matter how long I seem to wait. I would consider just skipping this website however as I am using this to learn I am trying to see if I can fix this issue yet have found no solutions. It may simply be due to a lack of my understanding with regards to what defines websites you can scrape or cannot.

Here is the code snippet that is relevant to this:

for elems in stack:
    print("Looking at Site")
    print(elems)
    print("Here")
    http = urllib3.PoolManager()
    scan_url = elems
    print("Here2")
    try:
        response = http.request('GET', scan_url)
    except:
        confirmed_stack.append("Cannot Scrape: "+elems+"")
        continue
    print("Here3")
    soup = BeautifulSoup(response.data, 'html.parser')
    print("Yer")
    texts = soup.find_all(text=True)
    visible_texts = filter(tag_visible, texts)
    # print("HTML")
    pause_rand2 = random.randint(1, 2)
    time.sleep(pause_rand2)

I have imported the following into libraries in python although not all are relevant or used since I was playing around with them:

from googlesearch import search
from bs4 import BeautifulSoup
from bs4.element import Comment
import random
import urllib3
import math
import re
from collections import Counter
import os
import time
from translate import Translator
import requests

Upvotes: 0

Views: 284

Answers (1)

chitown88
chitown88

Reputation: 28595

Try include user agent within the headers:

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}  #<-- add this line


for elems in stack:
                print("Looking at Site")
                print(elems)
                print("Here")
                http = urllib3.PoolManager()
                scan_url = elems
                print("Here2")
                try:
                    response = requests.get(scan_url, headers=headers) #<-- adjusted this line
                except:
                    confirmed_stack.append("Cannot Scrape: "+elems+"")
                    continue
                print("Here3")
                soup = BeautifulSoup(response.text, 'html.parser') #<-- minor change here
                print("Yer")
                texts = soup.find_all(text=True)
                visible_texts = filter(tag_visible, texts) 
                #print("HTML")
                pause_rand2 = random.randint(1,2)
                time.sleep(pause_rand2)

Upvotes: 1

Related Questions