EEEEEZINO
EEEEEZINO

Reputation: 63

How to know download file extension in python?

This is my jpg image download source:

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import urllib.request
import os
import shutil
from mimetypes import guess_extension

img_folder = ("c:/test")
if os.path.exists(img_folder):
    shutil.rmtree(img_folder)

path = (r"C:\Users\qpslt\Desktop\py\chromedriver_win32\chromedriver.exe")
driver = webdriver.Chrome(path)
site_url = ("https://gall.dcinside.com/board/view/?id=baseball_new8&no=10131338&exception_mode=recommend&page=1")
driver.get(site_url)
images = driver.find_elements_by_xpath('//div[@class="writing_view_box"]//img')

for i, img in enumerate(images, 1):
    img_url = img.get_attribute('src')
    print(i, img_url)
    r = requests.get(img_url, headers={'Referer': site_url})
    try:   #폴더 만들기
        if not os.path.exists(img_folder):
            os.makedirs(img_folder)
    except Exception as er:
        print("{}에러가 발생했습니다.".format(er))
        break;
    break;
    with open("c:/test/{}.jpg".format(i), 'wb') as f:
        f.write(r.content)

I don't always know the extension of the image.

How do you know the extension of the file you are downloading?

Upvotes: 5

Views: 2337

Answers (3)

Will Keeling
Will Keeling

Reputation: 23004

If the image link has no extension (e.g. if the image is dynamically generated from a php script), then you could map the content-type header of the image response to the file extension using mimetypes.guess_extension()

For example:

import mimetypes

...

r = requests.get(img_url, headers={'Referer': site_url})
extension = mimetypes.guess_extension(r.headers.get('content-type', '').split(';')[0]) 

...

with open("c:/test/{}{}".format(i, extension or '.jpg'), 'wb') as f:

The example above will try to use the mapped extension when it exists, but will fall back to using .jpg when there is no mapping (e.g. if the content-type header does not exist or specifies an unknown type).

Upvotes: 6

schoolboychik
schoolboychik

Reputation: 67

i bumped into the same problem and decided to get all the headers manually (with request.headers.get()) and one of them actually had a filename with the extension

in my case it was 'Content-Disposition': 'attachment;filename=%D0%BF%D1%80%D0%BE%D0%B5%D0%BA%D1%82%20%D0%A4%D0%97.doc'

Upvotes: 1

bastantoine
bastantoine

Reputation: 592

I would try to first split on the / to get only the last part which is, I think, the filename (including the extension) of the picture you've downloaded, and then split on the . to separate the filename from the extension.

img_url = 'path/to/your/picture.jpg'
split1 = img_url.split('/') # returns ['path', 'to', 'your', 'picture.jpg']
file = split1[-1]
filename, extension = file.split('.') # file is 'picture' and extension is 'jpg'

Upvotes: 0

Related Questions