Pearl
Pearl

Reputation: 133

How to automatically download multiple images with broken links in R?

The goal here is to download a bunch of images, but some of the image URLs are broken. What I want to do is modify the code with a simple next statement so that if the link returns anything but status code 200 skip to the next URL (or if the link returns a 404 skip to the next), but I am not sure how to write this in vectorized code and when I try to write this in a for loop I cannot figure out how to initialize a vector of type "picture" to write to in the for loop. So now I am looking at the code for the function trying to figure out where the error gets called and where to put the next statement or something akin to it... if you cannot put a next statement in some form of vectorized code:

Simple Vectorized Code:

library(magick)
library(rsvg)

image_urls <- na.omit(articles$url_to_image)
image_content <- image_read(image_urls)

Opaque "Function" Code (Where does the error get called?---just a bunch of calls to downloading different types of images)

function (path, density = NULL, depth = NULL, strip = FALSE, 
    coalesce = TRUE, defines = NULL) 
{
    if (is.numeric(density)) 
        density <- paste0(density, "x", density)
    density <- as.character(density)
    depth <- as.integer(depth)
    
    #doesn't seem relevant: https://rdrr.io/cran/magick/src/R/defines.R
    defines <- validate_defines(defines)
    
    #test whether the object is an instance of an S4 class and a function to test inheritance relationships between object and class -- seems relevant maybe?
    image <- if (isS4(path) && methods::is(path, "Image"))
      {
        #bioconductor class
        convert_EBImage(path)
    }
    else if (inherits(path, "nativeRaster") || (is.matrix(path) && 
        is.integer(path))) {
        image_read_nativeraster(path)
    }
    else if (inherits(path, "cimg")) {
        image_read_cimg((path))
    }
    else if (grDevices::is.raster(path)) {
        image_read_raster2(path)
    }
    else if (is.matrix(path) && is.character(path)) {
        image_read_raster2(grDevices::as.raster(path))
    }
    else if (is.array(path)) {
        image_readbitmap(path)
    }
    else if (is.raw(path)) {
        magick_image_readbin(path, density, depth, strip, defines)
    }
    else if (is.character(path) && all(nchar(path))) {
        path <- vapply(path, replace_url, character(1))
        path <- if (is_windows()) {
            enc2utf8(path)
        }
        else {
            enc2native(path)
        }
        magick_image_readpath(path, density, depth, strip, defines)
    }
    else {
        stop("path must be URL, filename or raw vector")
    }
    if (is.character(path) && !isTRUE(magick_config()$rsvg)) {
        if (any(grepl("\\.svg$", tolower(path))) || any(grepl("svg|mvg", 
            tolower(image_info(image)$format)))) {
            warning("ImageMagick was built without librsvg which causes poor qualty of SVG rendering.\nFor better results use image_read_svg() which uses the rsvg package.", 
                call. = FALSE)
        }
    }
    if (isTRUE(coalesce) && length(image) > 1 && identical("GIF", 
        toupper(image_info(image)$format[1]))) {
        return(image_coalesce(image))
    }
    return(image)
}

When the link is broken it returns: Error in download_url(path) : Failed to download "link" (HTTP 404) when the URL is broken

Possible For Loop Code?

library(magick)
library(rsvg)

image_urls <- na.omit(articles$url_to_image)

image_content <- c() #doesn't work, nor does NULL 
#nor does setting to typeof image_content <- image_url[1]

for(i in 1:length(image_urls){
  image_content[i] = image_read(image_urls[i])
    if(grepl('404', download_path(url), fixed = TRUE) == T)
    next
}

But again, I cannot initialize, and I don't know if the loop will break before it gets to the if statement in any case.

Maybe there is another library I should use... or just another language?

Here is some sample data

data <- c("https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOgEbG.img?h=488&w=799&m=6&q=60&o=f&l=f", 
"https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOh6FW.img?h=533&w=799&m=6&q=60&o=f&l=f", 
"https://img-s-msn-com.net/tenant/amp/entityid/AAOgIFh.img?h=450&w=799&m=6&q=60&o=f&l=f&x=570&y")

Upvotes: 2

Views: 373

Answers (2)

Martin Gal
Martin Gal

Reputation: 16978

You could try the try function:

image_urls <- data

image_content <- lapply(seq_along(image_urls), function(i) try(image_read(image_urls[i])))

This stores your images in a list. Using

image_content[[1]]

gives you access to the first image. If there are errors like

Error in curl::curl_fetch_memory(url) : 
Could not resolve host: img-s-msn-com.net simpleError in curl::curl_fetch_memory(url)

those are skipped and the loop proceeds to the next task.

Upvotes: 4

nniloc
nniloc

Reputation: 4243

Another option is to use purrr::safely to create a "safe" version of image_read which will return both result and error for each url.

Results can be extracted from the list using something like purrr::map(y,`[[`, 'result').

# two working links and one broken
urls <- c("https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOgEbG.img?h=488&w=799&m=6&q=60&o=f&l=f", 
          "https://img-s-msn-com.akamaized.net/tenant/amp/entityid/AAOh6FW.img?h=533&w=799&m=6&q=60&o=f&l=f", 
          "https://img-s-msn-com.net/tenant/amp/entityid/AAOgIFh.img?h=450&w=799&m=6&q=60&o=f&l=f&x=570&y")

# create 'safe' function
image_read_safe <- purrr::safely(magick::image_read)

# apply 'safe' function
y <- purrr::map(urls, image_read_safe)

y
#> [[1]]
#> [[1]]$result
#>   format width height colorspace matte filesize density
#> 1   JPEG   799    488       sRGB FALSE    39743   96x96
#> 
#> [[1]]$error
#> NULL
#> 
#> 
#> [[2]]
#> [[2]]$result
#>   format width height colorspace matte filesize density
#> 1   JPEG   799    533       sRGB FALSE    53910   96x96
#> 
#> [[2]]$error
#> NULL
#> 
#> 
#> [[3]]
#> [[3]]$result
#> NULL
#> 
#> [[3]]$error
#> <simpleError in curl::curl_fetch_memory(url): Could not resolve host: img-s-msn-com.net>

Created on 2021-09-10 by the reprex package (v2.0.0)

Upvotes: 2

Related Questions