nath
nath

Reputation: 113

how to return the image with the largest dimension

I have been able to filter all the image url from a page and displayed them one after the other

import requests
from bs4 import BeautifulSoup


article_URL = "https://medium.com/bhavaniravi/build-your-1st-python-web-app-with-flask-b039d11f101c"
response = requests.get(article_URL)
soup = bs4.BeautifulSoup(response.text,'html.parser')
images = soup.find('body').find_all('img')
i = 0
image_url = []
for im in images:
    print(im)
    i+=1
    url = im.get('src')
    image_url.append(url)
    print('Downloading: ', url) 
    try:
        response = requests.get(url, stream=True)
        with open(str(i) + '.jpg', 'wb') as out_file:
            shutil.copyfileobj(response.raw, out_file)
            del response
    except:
        print('Could not download: ', url)

new = [x for x in image_url if x is not None]
for url in new:
    resp = requests.get(url, stream=True).raw
    image = np.asarray(bytearray(resp.read()), dtype="uint8")
    image = cv2.imdecode(image, cv2.IMREAD_COLOR)
#     height, width, channels = image.shape
    height, width, _ = image.shape
    dimension = []
    for items in height, width:
        dimension.append(items)
#     print(height, width)
    print(dimension)

I want to print the image with the largest dimension from the list of url

This is the result I have from the list which is not good enough

[72, 72]
[95, 96]
[13, 60]
[227, 973]
[17, 60]
[229, 771]

Upvotes: 0

Views: 613

Answers (2)

furas
furas

Reputation: 142681

I see two problems.

  1. you create dimention = [] inside loop so it removes previous value. You have to create dimention = [] before loop and inside loop use

    dimension.append( (width, height) )
    

    and after loop you can use max(dimension) to get pair with max width

  2. you keep only width, height in dimension so you don't know which file has this dimention. You should keep all information

    dimension.append( (width, height, url, filename) ) 
    

My version.

I use dictionary data to keep all information

data.append({
                'url': url,
                'path': filename,
                'width': width,
                'height': height,
            })

and later I use key in max() to get item with max width

max(data, key=lambda x:x['width'])

but the same way I could use x['height'] or x['width'] * x['height']

import requests
from bs4 import BeautifulSoup
import shutil
import cv2

article_URL = "https://medium.com/bhavaniravi/build-your-1st-python-web-app-with-flask-b039d11f101c"

response = requests.get(article_URL)
soup = BeautifulSoup(response.text, 'html.parser')
images = soup.find('body').find_all('img')

# --- loop --- 

data = []
i = 0

for img in images:
    print('HTML:', img)
    
    url = img.get('src')

    if url:  # skip `url` with `None`
        print('Downloading:', url) 
        try:
            response = requests.get(url, stream=True)

            i += 1
            url = url.rsplit('?', 1)[0]  # remove ?opt=20 after filename
            ext = url.rsplit('.', 1)[-1] # .png, .jpg, .jpeg
            filename = f'{i}.{ext}' 
            print('Filename:', filename)

            with open(filename, 'wb') as out_file:
                shutil.copyfileobj(response.raw, out_file)

            image = cv2.imread(filename)
            height, width = image.shape[:2]

            data.append({
                'url': url,
                'path': filename,
                'width': width,
                'height': height,
            })

        except Exception as ex:
            print('Could not download: ', url)
            print('Exception:', ex)

    print('---')

# --- after loop ---

print('max:', max(data, key=lambda x:x['width']))

all_sorted = sorted(data, key=lambda x:x['width'], reverse=True)

print('Top 3:', all_sorted[:3])
# or
for item in all_sorted[:3]:
    print(item['width'], item['url'])

BTW: to get images only with src

 .find_all('img', {'src': True})

Upvotes: 1

Knight Forked
Knight Forked

Reputation: 1619

Make these changes in your code, just after you create new array:

images = []
for url in new:
    resp = requests.get(url, stream=True).raw
    image = np.asarray(bytearray(resp.read()), dtype="uint8")
    image = cv2.imdecode(image, cv2.IMREAD_COLOR)
    images.append((image.shape, image))
# sort images by area (largest to smallest)
images.sort (key = lambda x: x[0][0] * x[0][1], reverse=True)

Largest image is at index 0 now and can be accessed by images[0][1] and it's shape can be printed using images[0][0]. You can change the lambda function to x[0][0] (sort by height) or x[0][1] (sort by width) as well.

Upvotes: 1

Related Questions