Newbie
Newbie

Reputation: 281

How to scrape images with Cheerio and paste to Google Sheets?

This is my first trial to learn to how to scrape images from a web and paste them to Google Sheets. I want to download the second image from https://ir.eia.gov/ngs/ngs.html and paste it to a Google Sheet. In the web, there are two images. I want to get the second image under <img alt="Working Gas in Underground Storage Compared with Five-Year Range" src="ngs.gif" border="0">. I like to learn how to reference its img alt= or src="ngs.gif" in the code, not the index so I can utilize the concept to other various HTML situations also. Can anyone help fix the following code so that I can learn? Thank you!

function test() {
  const url = 'https://ir.eia.gov/ngs/ngs.html';
  const res = UrlFetchApp.fetch(url, { muteHttpExceptions: true }).getContentText();
  var $ = Cheerio.load(res);
  
  // I want to download the image, <img alt="Working Gas in Underground Storage Compared with Five-Year Range" src="ngs.gif" border="0">
  // What should be changed in the following code?
  var chart = $('img').attr('src').find('ngs.gif');
  SpreadsheetApp.getActiveSheet().insertImage(chart, 1, 1);
}

enter image description here

Upvotes: 0

Views: 671

Answers (1)

Tanaike
Tanaike

Reputation: 201553

I believe your goal as follows.

  • You want to retrieve the 2nd image of img tags and put it to the Spreadsheet.

In this HTML, it seems that the URL is https://ir.eia.gov/ngs/ + filename. So I thought that the method of insertImage(url, column, row) can be used. When this is reflect to your script, how about the following modified script?

Modified script:

function test() {
  const url = 'https://ir.eia.gov/ngs/ngs.html';
  const res = UrlFetchApp.fetch(url, { muteHttpExceptions: true }).getContentText();
  const $ = Cheerio.load(res);
  const urls = [];
  $('img').each(function () {
    urls.push("https://ir.eia.gov/ngs/" + $(this).attr('src'));
  });
  if (urls.length > 1) {
    SpreadsheetApp.getActiveSheet().insertImage(urls[1], 1, 1); // 2nd image is retrieved.
  }
}
  • When this script is run, the URL of https://ir.eia.gov/ngs/ngs.gif is retrieved and the image is put to the Spreadsheet.

Reference:

Added:

About your following new question in the comment,

Thanks a lot! So other than calling the index of the image, is there no method to call either alt="Working Gas in Underground Storage Compared with Five-Year Range" or src="ngs.gif" in the code? I'm just curious to learn a smart way for a potential scenario, for instance, if a web has 20 images and the locations of those images keep changing day by day, so the second image is not always in the second place. Thank you again for any guide!

In this case, how about the following sample script?

Sample script:

function test() {
  const url = 'https://ir.eia.gov/ngs/ngs.html';
  const res = UrlFetchApp.fetch(url, { muteHttpExceptions: true }).getContentText();
  const $ = Cheerio.load(res);

  const obj = [];
  $('img').each(function () {
    const t = $(this);
    const src = t.attr('src');
    obj.push({ alt: t.attr('alt'), src: src, url: "https://ir.eia.gov/ngs/" + src });
  });

  const searchAltValue = "Working Gas in Underground Storage Compared with Five-Year Range";
  const searchSrcValue = "ngs.gif";
  const ar = obj.filter(({alt, src}) => alt == searchAltValue && src == searchSrcValue);
  if (ar.length > 0) {
    SpreadsheetApp.getActiveSheet().insertImage(ar[0].url, 1, 1);
  }
}
  • In this sample script, when the values of src and alt are Working Gas in Underground Storage Compared with Five-Year Range and ngs.gif, respectively, the URL is retrieved and put to the image.
  • If you want to select Working Gas in Underground Storage Compared with Five-Year Range OR ngs.gif, please modify alt == searchAltValue && src == searchSrcValue to alt == searchAltValue || src == searchSrcValue.

Upvotes: 3

Related Questions