Reputation: 113
I have been able to filter all the image url from a page and displayed them one after the other
import requests
from bs4 import BeautifulSoup
article_URL = "https://medium.com/bhavaniravi/build-your-1st-python-web-app-with-flask-b039d11f101c"
response = requests.get(article_URL)
soup = bs4.BeautifulSoup(response.text,'html.parser')
images = soup.find('body').find_all('img')
i = 0
image_url = []
for im in images:
print(im)
i+=1
url = im.get('src')
image_url.append(url)
print('Downloading: ', url)
try:
response = requests.get(url, stream=True)
with open(str(i) + '.jpg', 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
except:
print('Could not download: ', url)
new = [x for x in image_url if x is not None]
for url in new:
resp = requests.get(url, stream=True).raw
image = np.asarray(bytearray(resp.read()), dtype="uint8")
image = cv2.imdecode(image, cv2.IMREAD_COLOR)
# height, width, channels = image.shape
height, width, _ = image.shape
dimension = []
for items in height, width:
dimension.append(items)
# print(height, width)
print(dimension)
I want to print the image with the largest dimension from the list of url
This is the result I have from the list which is not good enough
[72, 72]
[95, 96]
[13, 60]
[227, 973]
[17, 60]
[229, 771]
Upvotes: 0
Views: 613
Reputation: 142681
I see two problems.
you create dimention = []
inside loop so it removes previous value. You have to create dimention = []
before loop and inside loop use
dimension.append( (width, height) )
and after loop you can use max(dimension)
to get pair with max width
you keep only width, height
in dimension
so you don't know which file has this dimention. You should keep all information
dimension.append( (width, height, url, filename) )
My version.
I use dictionary data
to keep all information
data.append({
'url': url,
'path': filename,
'width': width,
'height': height,
})
and later I use key
in max()
to get item with max width
max(data, key=lambda x:x['width'])
but the same way I could use x['height']
or x['width'] * x['height']
import requests
from bs4 import BeautifulSoup
import shutil
import cv2
article_URL = "https://medium.com/bhavaniravi/build-your-1st-python-web-app-with-flask-b039d11f101c"
response = requests.get(article_URL)
soup = BeautifulSoup(response.text, 'html.parser')
images = soup.find('body').find_all('img')
# --- loop ---
data = []
i = 0
for img in images:
print('HTML:', img)
url = img.get('src')
if url: # skip `url` with `None`
print('Downloading:', url)
try:
response = requests.get(url, stream=True)
i += 1
url = url.rsplit('?', 1)[0] # remove ?opt=20 after filename
ext = url.rsplit('.', 1)[-1] # .png, .jpg, .jpeg
filename = f'{i}.{ext}'
print('Filename:', filename)
with open(filename, 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
image = cv2.imread(filename)
height, width = image.shape[:2]
data.append({
'url': url,
'path': filename,
'width': width,
'height': height,
})
except Exception as ex:
print('Could not download: ', url)
print('Exception:', ex)
print('---')
# --- after loop ---
print('max:', max(data, key=lambda x:x['width']))
all_sorted = sorted(data, key=lambda x:x['width'], reverse=True)
print('Top 3:', all_sorted[:3])
# or
for item in all_sorted[:3]:
print(item['width'], item['url'])
BTW: to get images only with src
.find_all('img', {'src': True})
Upvotes: 1
Reputation: 1619
Make these changes in your code, just after you create new array:
images = []
for url in new:
resp = requests.get(url, stream=True).raw
image = np.asarray(bytearray(resp.read()), dtype="uint8")
image = cv2.imdecode(image, cv2.IMREAD_COLOR)
images.append((image.shape, image))
# sort images by area (largest to smallest)
images.sort (key = lambda x: x[0][0] * x[0][1], reverse=True)
Largest image is at index 0 now and can be accessed by images[0][1] and it's shape can be printed using images[0][0]. You can change the lambda function to x[0][0] (sort by height) or x[0][1] (sort by width) as well.
Upvotes: 1