Usman Rafiq
Usman Rafiq

Reputation: 580

Timeout Error When Downloading Image from Specific URL with Python requests

I've been working on a Python script that downloads images from various URLs and uploads them to AWS S3. My script has been functioning well for multiple domains, but I encounter a timeout error when trying to download an image from a specific URL (https://www.net-a-porter.com/variants/images/17266703523615883/in/w920_a3-4_q60.jpg).

I've attempted to troubleshoot by increasing the timeout and adding headers, yet the problem persists.

import requests
import tempfile
import os

def upload_image_to_s3_from_url(self, image_url, filename, download_timeout=120):
    """
    Downloads an image from the given URL to a temporary file and uploads it to AWS S3,
    then returns the S3 file URL.
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
            'Accept': 'image/avif,image/webp,image/apng,image/*,*/*;q=0.8'
        }
        response = requests.get(image_url, timeout=download_timeout, stream=True, headers=headers)
        response.raise_for_status()
        
        with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
            for chunk in response.iter_content(chunk_size=8192):
                tmp_file.write(chunk)
            
            file_url = self.upload_image_to_s3(tmp_file.name, filename)
        
        os.unlink(tmp_file.name)
        return file_url
    except requests.RequestException as e:
        raise Exception(f"Failed to download or upload image. Error: {e}")

Error encountered: Exception: Failed to download or upload image. Error: HTTPSConnectionPool(host='www.net-a-porter.com', port=443): Read timed out. (read timeout=60)

I've tried:

  1. Increasing download_timeout to higher values
  2. Modifying request headers to mimic a real browser session

This approach works for images from other domains, but not for the URL mentioned above.

Any insights or suggestions would be greatly appreciated. Thank you in advance for your help!

Upvotes: 2

Views: 410

Answers (1)

madflow
madflow

Reputation: 8520

I suspect, that the host has some kind of scraper detection in place and blocks the request by comparing the User Agent.

I was able to make a sucessful request by changing it to:

"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36",

Upvotes: 1

Related Questions