Extracting table value from an URL with Node JS

I am quite new to Node JS and express but I am trying to build a website which serves static files. After some research I've found out that NodeJS with Express can be quite useful for this. So far I managed to serve some static html files which are located on my server, but now I want to do something else: I have an URL to an html page, and in that html page, there is a table with some information.

I want to extract specific a couple of values from it, and 1) save it as JSON in a file, 2) write those values in a html page. I've tried to play with jQuery, but so far I've been unsuccessful.

This is what I have so far:

1.node app running on port 8081, which I will further access it from anywhere with NGINX reverse proxy (I already have nginx setup and it works)

2.I can get the URL and serve it as HTML when I use the proper URI.

3.I see that the table doesn't have an ID, but only the "details" class associated with it. Also, I am only interested in getting these rows:

<div class='group'>
<table class='details'>
<tr>
<th>Status:</th>
<td>
With editors
</td>
</tr>

From what I've seen so far, jQuery would work fine if the table has an ID.

This is my code in app.js


var express = require('express');
var app = express();
var request = require('request');
const path = require('path');

var content;

app.use('/', function(req, res, next) {
  var status = 'It works';
  console.log('This is very %s', status);
  //console.log(content);
  next();
});

request(
  {
    uri:
      'https://authors.aps.org/Submissions/status?utf8=%E2%9C%93&accode=CH10674&author=Poenaru&commit=Submit'
  },
  function(error, response, body) {
    content = body;
  }
);

app.get('/', function(req, res) {
  console.log('Got a GET request for the homepage');
  res.sendFile(path.join(__dirname, '/', 'index.html'));
});

app.get('/url', function(req, res) {
  console.log('You requested table data!!!');

TO DO:   SHOW ONLY THE THE VALUES OF THAT TABLE INSTEAD OF THE WHOLE HTML PAGE

  res.send(content);
});

var server = app.listen(8081, function() {
  var host = server.address().address;
  var port = server.address().port;
  console.log('Node-App listening at http://%s:%s', host, port);
});

Basically, the HTML content of that URL is saved into content variable, and now I would like to save only the table from it, and also output only the saved part to the new html page.

Any ideas? Thank you in advance :)

Upvotes: 0

Answers (2)

Robert Poenaru

Reputation: 301

Ok, So I've come across this package called cheerio which basically allows one to use jQuery on the server. Having the html code from that specific URL, I could search in that table the elements that I need. Cheerio is quite straight-forward and with this code I got the results I needed:

var cheerio = require('cheerio');
request(
  'https://authors.aps.org/Submissions/status?utf8=%E2%9C%93&accode=CH10674&author=Poenaru&commit=Submit',
  (error, res, html) => {
    if (!error && res.statusCode === 200) {
      const $ = cheerio.load(html);
      const details = $('.details');
      const articleInfo = details.find('th').eq(0);
      const articleStatus = details
        .find('th')
        .next()
        .eq(0);
      //console.log(details.html());
      console.log(articleInfo.html());
      console.log(articleStatus.html());
    }
  }
);

Thank you @O.Jones and @avcS for guiding me to jsdon and html-node-parser. I will definitely play with those in the near future :)

Cheers!

Upvotes: 2

O. Jones

Reputation: 108806

Your task is called "scraping." You want to scrape a particular chunk of data from some web page you did not create and then return it as part of your own web page.

You have noticed a problem with scraping: often the page you're scraping does not cleanly identify the data you want with a distinctive id. So you must use some guesswork to find it. @AvcS pointed out a server-side npm library called jsdom you can use for this purpose.

Notice this: Even though browsers and nodejs both use Javascript, they are still very different environments. Browser Javascript has lots of built-in APIs to access web pages' Document Object Models (DOMs). But nodejs doesn't have those APIs. If you try to load jQuery into node.js, it won't work, because it depends on browser DOM APIs. The jsdom package gives you some of those DOM APIs.

Once you have fetched that web page to scrape, code like this may help you get what you need.

const jsdom = require("jsdom");
const { JSDOM } = jsdom;
...
const page = new JSDOM(page_in_text_string).window;

Then you can use a subset of the DOM APIs to find the elements you want in your page. In your example, you are looking for elements with the selector div.class table.group. You're looking for the div.class element.

You can do this sort of thing to find what you need:

const desiredTbl = page.document.querySelector("div.class table.group");
const desiredDiv = desiredTbl ? desiredTbl.parentNode : null;
const result = desiredDiv ? desiredDiv.textContent : null;

Finally do this:

page.close();

Your question says you want certain rows from your document. HTML document don't have rows, they have elements. If you want to extract just parts of elements (part of the table rather than the whole thing) you'll need to use some text-string code. Just sayin'

Also, I have not debugged any of this. That is left to you.

There's a smaller and faster library to do similar things called node-html-parser. If performance is important you may want that one instead.

Upvotes: 1

Extracting table value from an URL with Node JS

Answers (2)

Related Questions