Ali Salehi
Ali Salehi

Reputation: 6999

How to download an image using Selenium (any version)?

I was wondering, how can one use selenium/webdriver to download an image for a page. Assuming that the user session is required to download the image hence having pure URL is not helpful. Any sample code is highly appreciated.

Upvotes: 37

Views: 87702

Answers (14)

Khetho Mtembo
Khetho Mtembo

Reputation: 428

The following solution written in Kotlin enables an image already on the page to be retrieved from the cache without need to make a second request. It works using the fetch API with the cache option with value force-cache to ask the browser to look for an asset in the cache fresh or stale.

import java.util.Base64.getDecoder()
import org.apache.commons.io.FileUtils

...

/* Get the webelement with the url to the image asset*/
val src = element.getAttribute("src")

/* You may need to dynamically determine your image type*/ 
/* Replace type with the appropriate value e.g. jpeg, gif etc.*/
val imageType = "image/type"

/*Create javascript to extract the image using fetch API*/
val js = driver as JavascriptExecutor
var script = "var done  = arguments[arguments.length - 1];" +
        "console.log('Downloading " + src + "');" +
        "fetch('" + src + "', {cache : 'force-cache'})" +
        ".then(r => r.blob({type: '" + imageType + "'}))" +
        ".then(blob => {" +
        "    var reader = new FileReader();" +
        "    reader.readAsDataURL(blob); " +
        "    reader.onloadend = function() {" +
        "    var base64data = reader.result;" +
        "    console.log(base64data);" +
        "    done(base64data);" +
        "    }})"

/*Inject and execute javascript asynchronously*/
val base64Url = js.executeAsyncScript(script) as String

/* seperate the prefix from the base64 data*/
val base64Array = base64Url.split(",".toRegex())
.dropLastWhile { it.isEmpty() }
.toTypedArray()

/* Get base64 data*/
val base64Data = base64Array[base64Array.size - 1]

/* Convert base64 data into byte array */
val data: ByteArray = getDecoder().decode(base64Data)

/* Create file to write to with the appropriate name*/
file = File("filename.type")

/* Write to file */
FileUtils.writeByteArrayToFile(file, data);

...

The assumption is the asset is in the cache. Some providers include the Cache-Control response header with no-store to request the browser to not store the image in the cache, in which case this solution will not work.

You may be able to also use the only-if-cached cache option only if the request mode is same-origin and the browser supports it. The force-cache option falls back to making the normal request call if the image is not in the cache

Upvotes: 0

Gravity API
Gravity API

Reputation: 859

How to download to a file, taking URL from element text or attribute

The complete extension code can be found here:

https://github.com/gravity-api/gravity-core/blob/master/src/csharp/Gravity.Core/Gravity.Core/Extensions/WebElementExtensions.cs

If you want to use this method without writing the code, use the NuGet https://www.nuget.org/packages/Gravity.Core/

Install-Package Gravity.Core -Version 2020.7.5.3

Usage

using OpenQA.Selenium.Extensions;
 
...
 
var driver = new ChromeDriver();
 
// from element attribute
var element = driver.FindElement(By.XPath("//img[@id='my_img']")).DownloadResource(path: @"C:\images\cap_image_01.png", attribute: "src");
 
// from element text
var element = driver.FindElement(By.XPath("//div[1]")).DownloadResource(path: @"C:\images\cap_image_01.png");

It is recommended to use the NuGet, since it contains a lot more tools and extension for Selenium

For using without the NuGet (implement on your own)

Extension Class

using System.IO;
using System.Net.Http;
using System.Text.RegularExpressions;
 
namespace Extensions
{
    public static class WebElementExtensions
    {
        public static IWebElement DownloadResource(this IWebElement element, string path)
        {
            return DoDownloadResource(element, path, "");
        }
 
        public static IWebElement DownloadResource(this IWebElement element, string path, string attribute)
        {
            return DoDownloadResource(element, path, attribute);
        }
 
        private static IWebElement DoDownloadResource(this IWebElement element, string path, string attribute)
        {
            // get resource address
            var resource = (string.IsNullOrEmpty(attribute))
                ? element.Text
                : element.GetAttribute(attribute);
 
            // download resource
            using (var client = new HttpClient())
            {
                // get response for the current resource
                var httpResponseMessage = client.GetAsync(resource).GetAwaiter().GetResult();
 
                // exit condition
                if (!httpResponseMessage.IsSuccessStatusCode) return element;
 
                // create directories path
                Directory.CreateDirectory(path);
 
                // get absolute file name
                var fileName = Regex.Match(resource, @"[^/\\&\?]+\.\w{3,4}(?=([\?&].*$|$))").Value;
                path = (path.LastIndexOf(@"\") == path.Length - 1)
                    ? path + fileName
                    : path + $@"\{fileName}";
 
                // write the file
                File.WriteAllBytes(path, httpResponseMessage.Content.ReadAsByteArrayAsync().GetAwaiter().GetResult());
            }
 
            // keep the fluent
            return element;
        }
    }
}

Usage

using Extensions;
 
...
 
var driver = new ChromeDriver();
 
// from element attribute
var element = driver.FindElement(By.XPath("//img[@id='my_img']")).DownloadResource(path: @"C:\images\cap_image_01.png", attribute: "src");
 
// from element text
var element = driver.FindElement(By.XPath("//div[1]")).DownloadResource(path: @"C:\images\cap_image_01.png");

Upvotes: 0

tot
tot

Reputation: 164

Although @aboy021 JS code is syntactly correct I couldn't the code running. (using Chrome V83.xx)

However this code worked (Java):

    String url = "/your-url-goes.here.jpg";
    String imageData = (String) ((JavascriptExecutor) driver).executeAsyncScript(
            "var callback = arguments[0];" + // The callback from ExecuteAsyncScript
                    "var reader;" +
                    "var xhr = new XMLHttpRequest();" +
                    "xhr.onreadystatechange = function() {" +
                    "  if (xhr.readyState == 4) {" +
                        "var reader = new FileReader();" +
                        "reader.readAsDataURL(xhr.response);" +
                        "reader.onloadend = function() {" +
                        "    callback(reader.result);" +
                        "}" +
                    "  }" +
                    "};" +
                    "xhr.open('GET', '" + url + "', true);" +
                    "xhr.responseType = 'blob';" +
                    "xhr.send();");

    String base64Data = imageData.split(",")[1];

    byte[] decodedBytes = Base64.getDecoder().decode(base64Data);
    try (OutputStream stream = new FileOutputStream("c:\\dev\\tmp\\output.jpg")) {
        stream.write(decodedBytes);
    } catch (IOException e) {
        e.printStackTrace();
    }

Upvotes: 0

Benoît Galy
Benoît Galy

Reputation: 83

The only way I found to avoid downloading the image twice is to use the Chrome DevTools Protocol Viewer.

In Python, this gives:

import base64
import pychrome
def save_image(file_content, file_name):
    try:
       file_content=base64.b64decode(file_content)
       with open("C:\\Crawler\\temp\\" + file_name,"wb") as f:
            f.write(file_content)
    except Exception as e:
       print(str(e))

def response_received(requestId, loaderId, timestamp, type, response, frameId):
    if type == 'Image':
        url = response.get('url')
        print(f"Image loaded: {url}")
        response_body = tab.Network.getResponseBody(requestId=requestId)
        file_name = url.split('/')[-1].split('?')[0]
        if file_name:
            save_image(response_body['body'], file_name)


tab.Network.responseReceived = response_received

# start the tab 
tab.start()

# call method
tab.Network.enable()

# get request to target the site selenium 
driver.get("https://www.realtor.com/ads/forsale/TMAI112283AAAA")

# wait for loading
tab.wait(50)

Upvotes: 3

try the following

JavascriptExecutor js = (JavascriptExecutor) driver;                              
String base64string = (String) js.executeScript("var c = document.createElement('canvas');"
                       + " var ctx = c.getContext('2d');"
                       + "var img = document.getElementsByTagName('img')[0];"
                       + "c.height=img.naturalHeight;"
                       + "c.width=img.naturalWidth;"
                       + "ctx.drawImage(img, 0, 0,img.naturalWidth, img.naturalHeight);"
                       + "var base64String = c.toDataURL();"
                       + "return base64String;");
String[] base64Array = base64string.split(",");

String base64 = base64Array[base64Array.length - 1];

byte[] data = Base64.decode(base64);

ByteArrayInputStream memstream = new ByteArrayInputStream(data);
BufferedImage saveImage = ImageIO.read(memstream);

ImageIO.write(saveImage, "png", new File("path"));

Upvotes: 6

aboy021
aboy021

Reputation: 2305

For my use case there were cookies and other issues that made the other approaches here unsuitable.

I ended up using an XMLHttpRequest to populate a FileReader (from How to convert image into base64 string using javascript, and then calling that using Selenium's ExecuteAsyncScript (as shown in Selenium and asynchronous JavaScript calls). This allowed me to get a Data URL which was straight forward to parse.

Here's my C# code for getting the Data URL:

public string ImageUrlToDataUrl(IWebDriver driver, string imageUrl)
{
  var js = new StringBuilder();
  js.AppendLine("var done = arguments[0];"); // The callback from ExecuteAsyncScript
  js.AppendLine(@"
    function toDataURL(url, callback) {
      var xhr = new XMLHttpRequest();
      xhr.onload = function() {
        var reader = new FileReader();
        reader.onloadend = function() {
          callback(reader.result);
        }
        reader.readAsDataURL(xhr.response);
      };
      xhr.open('GET', url);
      xhr.responseType = 'blob';
      xhr.send();
    }"); // XMLHttpRequest -> FileReader -> DataURL conversion
  js.AppendLine("toDataURL('" + imageUrl + "', done);"); // Invoke the function

  var executor = (IJavaScriptExecutor) driver;
  var dataUrl = executor.ExecuteAsyncScript(js.ToString()) as string;
  return dataUrl;
}

Upvotes: 5

coding_idiot
coding_idiot

Reputation: 13734

I prefer doing something like this :

1. Get the SRC attribute of the image.
2. Use ImageIO.read to read the image onto a BufferedImage
3. Save the BufferedImage using ImageIO.write function

For e.g.

String src = imgElement.getAttribute('src');
BufferedImage bufferedImage = ImageIO.read(new URL(src));
File outputfile = new File("saved.png");
ImageIO.write(bufferedImage, "png", outputfile);

Upvotes: 23

Super Mario
Super Mario

Reputation: 939

Works for me:

# open the image in a new tab
driver.execute_script('''window.open("''' + wanted_url + '''","_blank");''')
sleep(2)
driver.switch_to.window(driver.window_handles[1])
sleep(2)

# make screenshot
driver.save_screenshot("C://Folder/" + photo_name + ".jpeg")
sleep(2)

# close the new tab
driver.execute_script('''window.close();''')
sleep(2)

#back to original tab
driver.switch_to.window(driver.window_handles[0])

Upvotes: -1

nacmonad
nacmonad

Reputation: 39

here is a javascript solution. it's a tad silly -- and i'm weary of hitting the source image's server with too many requests. can someone tell me if the fetch() accesses the browser's cache? i don't want to spam the source server.

it appends a FileReader() to the window, fetches and converts the image to base64 and tags that string onto the window.

the driver can then return that window variable.

export async function scrapePic(driver) {
try {
console.log("waiting for that profile piccah")
console.log(driver)

let rootEl = await driver.findElement(By.css('.your-root-element'));
let imgEl = await rootEl.findElement(By.css('img'))
await driver.wait(until.elementIsVisible(imgEl, 10000));
console.log('profile piccah found')
let img = await imgEl.getAttribute('src')
//attach reader to driver window
await driver.executeScript(`window.myFileReader = new FileReader();`)
await driver.executeScript(`
  window.myFileReader.onloadend = function() {
    window['profileImage'] = this.result
  }
  fetch( arguments[0] ).then( res => res.blob() ).then( blob => window.electronFileReader.readAsDataURL(blob) )
  `, img)
await driver.sleep(5000)
let img64 = await driver.executeScript(`return window.profileImage`)
console.log(img64)


} catch (e) {
console.log(e)
} finally {
return img64
  }
}

Upvotes: 0

speedplane
speedplane

Reputation: 16121

Other solutions here don't work across all browsers, don't work across all websites, or both.

This solution should be far more robust. It uses the browser to view the image, resizes the browser to fit the image size, takes a screenshot, and finally resizes the browser back to the original size.

Python:

def get_image(driver, img_url):
    '''Given an images url, return a binary screenshot of it in png format.'''
    driver.get_url(img_url)

    # Get the dimensions of the browser and image.
    orig_h = driver.execute_script("return window.outerHeight")
    orig_w = driver.execute_script("return window.outerWidth")
    margin_h = orig_h - driver.execute_script("return window.innerHeight")
    margin_w = orig_w - driver.execute_script("return window.innerWidth")
    new_h = driver.execute_script('return document.getElementsByTagName("img")[0].height')
    new_w = driver.execute_script('return document.getElementsByTagName("img")[0].width')

    # Resize the browser window.
    logging.info("Getting Image: orig %sX%s, marg %sX%s, img %sX%s - %s"%(
      orig_w, orig_h, margin_w, margin_h, new_w, new_h, img_url))
    driver.set_window_size(new_w + margin_w, new_h + margin_h)

    # Get the image by taking a screenshot of the page.
    img_val = driver.get_screenshot_as_png()
    # Set the window size back to what it was.
    driver.set_window_size(orig_w, orig_h)

    # Go back to where we started.
    driver.back()
    return img_val

One disadvantage of this solution is that if the image is very small, the browser will not resize that small, and you may get a black border around it.

Upvotes: 2

Bassem Shahin
Bassem Shahin

Reputation: 706

use selenium for getting the image src

elemImg.get_attribute('src')

use the programming language for this, for python; check this answer: How to save an image locally using Python whose URL address I already know?

Upvotes: 1

samson
samson

Reputation: 1607

I prefer like this:

 WebElement logo = driver.findElement(By.cssSelector(".image-logo"));
 String logoSRC = logo.getAttribute("src");

 URL imageURL = new URL(logoSRC);
 BufferedImage saveImage = ImageIO.read(imageURL);

 ImageIO.write(saveImage, "png", new File("logo-image.png"));

Upvotes: 5

Gadget
Gadget

Reputation: 484

Another mostly correct solution is to download it directly by simple HTTP request.
You could use webDriver's user session, cause it stores cookies.
In my example, I'm just analyzing what status code it returns. If 200, then image exists and it is available for show or download. If you need to really download file itself - you could just get all image data from httpResponse entity (use it as simple input stream).

// just look at your cookie's content (e.g. using browser)
// and import these settings from it
private static final String SESSION_COOKIE_NAME = "JSESSIONID";
private static final String DOMAIN = "domain.here.com";
private static final String COOKIE_PATH = "/cookie/path/here";

protected boolean isResourceAvailableByUrl(String resourceUrl) {
    HttpClient httpClient = new DefaultHttpClient();
    HttpContext localContext = new BasicHttpContext();
    BasicCookieStore cookieStore = new BasicCookieStore();
    // apply jsessionid cookie if it exists
    cookieStore.addCookie(getSessionCookie());
    localContext.setAttribute(ClientContext.COOKIE_STORE, cookieStore);
    // resourceUrl - is url which leads to image
    HttpGet httpGet = new HttpGet(resourceUrl);

    try {
        HttpResponse httpResponse = httpClient.execute(httpGet, localContext);
        return httpResponse.getStatusLine().getStatusCode() == HttpStatus.SC_OK;
    } catch (IOException e) {
        return false;
    }
}

protected BasicClientCookie getSessionCookie() {
    Cookie originalCookie = webDriver.manage().getCookieNamed(SESSION_COOKIE_NAME);

    if (originalCookie == null) {
        return null;
    }

    // just build new apache-like cookie based on webDriver's one
    String cookieName = originalCookie.getName();
    String cookieValue = originalCookie.getValue();
    BasicClientCookie resultCookie = new BasicClientCookie(cookieName, cookieValue);
    resultCookie.setDomain(DOMAIN);
    resultCookie.setExpiryDate(originalCookie.getExpiry());
    resultCookie.setPath(COOKIE_PATH);
    return resultCookie;
}

Upvotes: 3

Gadget
Gadget

Reputation: 484

If you need to test that image is available and exists, you may do like this:

protected boolean isResourceAvailableByUrl(String resourceUrl) {
    // backup current url, to come back to it in future
    String currentUrl = webDriver.getCurrentUrl();
    try {
        // try to get image by url
        webDriver.get(resourceUrl);
        // if "resource not found" message was not appeared - image exists
        return webDriver.findElements(RESOURCE_NOT_FOUND).isEmpty();
    } finally {
        // back to page
        webDriver.get(currentUrl);
    }
}

But you need to be sure, that going through currentUrl will really turn you back on page before execution of this method. In my case it was so. If not - you may try to use:

webDriver.navigate().back()

And also, unfortunately, as it seems, there is no any chance to analyze response status code. That's why you need to find any specific web element on NOT_FOUND page and check that it was appeared and decide then - that image doesn't exist.

It is just workaround, cause I found no any official way to solve it.

NOTE: This solution is helpful in case when you use authorized session to get resource, and can't just download it by ImageIO or strictly by HttpClient.

Upvotes: 0

Related Questions