pythonmachine-learningselenium-webdriverjupyter-notebookselenium-chromedriver

Reputation: 91

Why my script stop when scrapping img from web?

I am currently trying to enrich a dataset for machine learning using a script that allows me to download images from google.

I first browse a dataframe that contains the fields to search on google, with the selenium webdriver I then retrieve the urls of the images to download, and save them in specific folders depending on the field via this function:

def download_image(file_path, url, file_name):
    try:
        response = requests.get(url)
        response.raise_for_status()
        with open(os.path.join(file_path, file_name), 'wb') as file:
            file.write(response.content)
        print(f"Image downloaded successfully to {os.path.join(file_path, file_name)}")
    except requests.exceptions.HTTPError as http_error:
        print(f"HTTP error occurred: {http_error}")
    except Exception as error:
        print(f"An error occurred: {error}")

which is called in this loop:

def enhanced_dataset_folder(name:str, tag:str, df):
    DRIVER_PATH = "chromedriver"
    wd = webdriver.Chrome(DRIVER_PATH)
    urls = get_images(tag, wd, 1, 2)
    folder_name = name.split('/')[0]
    props = tag.split(' ')
    test = []
    for i, url in enumerate(urls):
        try:
            img_name = str(i) + "_img"+str(i)+".jpg"
            download_image("train/"+folder_name+"/", url, img_name)
        except Exception as e:
            print('Fail: ', e)
            continue
        else:
            print("ok")
            #df.append([folder_name+"/"+img_name,tag,props[0],props[1],props[2]], ignore_index=True)
    wd.quit()

The google chrome window and the script always stop at the same time, no matter how many photos I get per page. I have this output, but no error comes out:

Image downloaded successfully to train/1982 Porsche 944/0_img0.jpg
ok
Image downloaded successfully to train/1982 Porsche 944/1_img1.jpg
ok
HTTP error occurred: 403 Client Error: Forbidden. Please comply with the User-Agent policy: https://meta.wikimedia.org/wiki/User-Agent_policy for url: https://upload.wikimedia.org/wikipedia/commons/1/13/1986_944_Turbo.jpg
ok
Image downloaded successfully to train/1996 Ferrari 550 Maranello/0_img0.jpg
ok
Image downloaded successfully to train/1996 Ferrari 550 Maranello/1_img1.jpg
ok
Image downloaded successfully to train/1996 Ferrari 550 Maranello/2_img2.jpg
ok
Image downloaded successfully to train/2001 BMW 3 Series Convertible/0_img0.jpg
ok

After that I have nothing, even if I let it run for more than 10 minutes. I know the problem is with the download_image function because when I don't call it the urls are retrieved for each occurrence of the dataframe

Upvotes: 2

Answers (3)

PyGuy

Reputation: 551

You are calling the HTTP server in synchronous mode, which means when the socket is connected your script would wait until the data is received and the connection is closed, or you have pressed ^C. This is a trick implemented by the firewall/web-server of the service you are trying to use.

You can switch to aiohttp to be able to perform several calls in asynchronous mode. You need to be careful to adjust your connection rate properly and introduce some proper gaps between your calls. This answer might help you: aiohttp: rate limiting parallel requests

You can use asyncio.sleep after creating a set of requests, and if they don't finish in the expected time, you can drop the future objects - which effectively means you are dropping your side of the connectin.

Upvotes: 2

dodrg

Reputation: 1211

It was quite helpful to read the error message:

HTTP error occurred: 403 Client Error: Forbidden. 
    Please comply with the User-Agent policy: https://meta.wikimedia.org/wiki/User-Agent_policy 
    for url: https://upload.wikimedia.org/wikipedia/commons/1/13/1986_944_Turbo.jpg

You obviously are violating the policies of the website. To protect themself they can take any countermeasures as they like, also sending fake content to you.

In this case (wikimedia.org) they tell you how they will accept you scrapping their files: https://meta.wikimedia.org/wiki/User-Agent_policy

They expect a proper user agent that allows them to classify the access and contact you. They urge you to send a proper agent-string to identify you as an individual, identifiable bot. – Else they take countermeasures.

They expect the word "bot" within the agent-string. The Syntax of the agent-string expected:

<client name>/<version> (<contact information>) <library/framework name>/<version> [<library name>/<version> ...]

# Example:
User-Agent: CoolBot/0.0 (https://example.org/coolbot/; [email protected]) generic-library/0.0

For Python, they also give a sample code scrap:

import requests

url = 'https://example/...'
headers = {'User-Agent': 'CoolBot/0.0 (https://example.org/coolbot/; [email protected])'}

response = requests.get(url, headers=headers)

So I would suggest to

setup a webpage as contact page / vcard for your bot, including an e-mail address. A small introduction of your project might be helpful.
Customize the agent string to identify the keyword "bot", your intent and the contact page

Then give the bot a run and tell, if things got better.

About the 403 - Forbidden

This is a qualified response to a GET or POST request. By this answer the HTTP request is finished and your script has to decide what to do next.
=> Your script decides to continue after writing to the log.

If you would have been block generally (i.e. by access rule for your IP address) you would see 403 for every single access to this server.
=> This is not the case within this logfile.

'Forbidden' occurs when accessing a restricted resource. As you get your URLs form a google search URLs to restricted files are possible, as the URLs might be published in the public area of a website.
=> There is nothing special with a 403 at the first glance.

The possibility with a 403 being a trigger is the combination of a 403-hit followed by a problem on a regular basis at the same site (or sites hosted by the same guys).

=> Some more details about these 403 combined with the problem would be nice.

As you write the problem disappeared: what have you changed?
Or did you just get a new search result form Google prioritizing other sites?

The answer to your question

Your statement significantly increases the probability of the 403-causing URL as a trigger URL:

I didn't change anything except bypassing the url causing the first 403 error which led to my script stopping. I didn't find yet the best behavior for this algorithm but this workaround allowed me to enrich my dataset

By doing this you bypassed the trigger.

The best thing for your project to avoid the problem is to gain acceptance by the scrapped websites (see above).

When they notice their trigger is discovered and bypassed, they choose another URL as trigger and the game restarts. — Don't be astonished, when your IP(-range) or fingerprinted profile gets blacklisted.

Summary

The problem does not come from your code but from the bot-tool and its settings. Violation of the usage policy will cause a reaction and is a common effect in the internet.

(I'm sure, you don't like the answer...)

Upvotes: 5

undetected Selenium

Reputation: 193308

This error message...

HTTP error occurred: 403 Client Error: Forbidden. Please comply with the User-Agent policy: https://meta.wikimedia.org/wiki/User-Agent_policy for url: https://upload.wikimedia.org/wikipedia/commons/1/13/1986_944_Turbo.jpg

...implies that HTTP 403 Forbidden response status code was encountered while accessing a valid URL.

Deep Dive

Possibly it's the same issue of Invalid Status code=403 text=Forbidden which we had been discussing for quite sometime now.

Solution

A blanket solution would be to add the argument --remote-allow-origins=* through an instance of Options as follows:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--remote-allow-origins=*")
DRIVER_PATH = "chromedriver"
wd = webdriver.Chrome(executable_path=DRIVER_PATH, options=options)

Upvotes: 0