Lucas Ribeiro
Lucas Ribeiro

Reputation: 60

Links from BeautifulSoup without href or <a>

I am trying to create a bot that scrapes all the image links from a site and store them somewhere else so I can download the images after.

from selenium import webdriver
import time
from bs4 import BeautifulSoup as bs  
import requests

url = 'https://www.artstation.com/artwork?sorting=trending'
page = requests.get(url)
driver = webdriver.Chrome()
driver.get(url)
time.sleep(3)
soup = bs(driver.page_source, 'html.parser')
gallery =  soup.find_all(class_="image-src")
data = gallery[0]
for x in range(len(gallery)):
    print("TAG:", sep="\n")
    print(gallery[x], sep="\n")

if page.status_code == 200:  
    print("Request OK")

This returns all the links tags i wanted but I can't find a way to remove the html or copy only the links to a new list. Here is an example of the tag i get:

<div class="image-src" image-src="https://cdnb.artstation.com/p/assets/images/images/012/269/255/20180810092820/smaller_square/vince-rizzi-batman-n52-p1-a.jpg?1533911301" ng-if="::!project.hide_as_adult"></div>

So, how do i get only the links within the gallery[] list? What i want to do after is to take this links and edit the /smaller-square/ directory to /large/, which is the one that has the high resolution image.

Upvotes: 1

Views: 621

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195543

The page loads it's data through AJAX, so through network inspector we see, where the call is made. This snippet will obtain all the image links found on page 1, sorted by trending:

import requests
import json

url = 'https://www.artstation.com/projects.json?page=1&sorting=trending'
page = requests.get(url)
json_data = json.loads(page.text)

for data in json_data['data']:
    print(data['cover']['medium_image_url'])

Prints:

https://cdna.artstation.com/p/assets/images/images/012/272/796/medium/ben-zhang-brigitte-hero-concept.jpg?1533921480
https://cdna.artstation.com/p/assets/covers/images/012/279/572/medium/ham-sung-choul-braveking-140823-1-3-s3-mini.jpg?1533959982
https://cdnb.artstation.com/p/assets/covers/images/012/275/963/medium/michael-vicente-orb-gem-thumb.jpg?1533933774
https://cdnb.artstation.com/p/assets/images/images/012/275/635/medium/michael-kutsche-piglet-by-michael-kutsche.jpg?1533932387
https://cdna.artstation.com/p/assets/images/images/012/273/384/medium/ben-zhang-unnamed.jpg?1533923353
https://cdnb.artstation.com/p/assets/covers/images/012/273/083/medium/michael-vicente-orb-guardian-thumb.jpg?1533922229

... and so on.

If you print the variable json_data, you will see other information the page sends (like icon image url, total_count, data about the author etc.)

Upvotes: 4

Rakesh
Rakesh

Reputation: 82785

You can access the attributes using key-value.

Ex:

from bs4 import BeautifulSoup
s = '''<div class="image-src" image-src="https://cdnb.artstation.com/p/assets/images/images/012/269/255/20180810092820/smaller_square/vince-rizzi-batman-n52-p1-a.jpg?1533911301" ng-if="::!project.hide_as_adult"></div>'''
soup = BeautifulSoup(s, "html.parser")
print(soup.find("div", class_="image-src")["image-src"])
#or
print(soup.find("div", class_="image-src").attrs['image-src'])

Output:

https://cdnb.artstation.com/p/assets/images/images/012/269/255/20180810092820/smaller_square/vince-rizzi-batman-n52-p1-a.jpg?1533911301

Upvotes: 1

Related Questions