How to get visual DOM structure from url in node.js

Question

I am wondering how to get "visual" DOM structure from url in node.js. When I try to get html content with request library, html structure is not correct.

const request = require('request');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;

request({ 'https://www.washingtonpost.com/news/acts-of-faith/wp/2017/06/30/trump-promised-to-destroy-the-johnson-amendment-congress-is-targeting-it-now/', jar: true }, function (e, r, body) {
  console.log(body);
});

reurned html structure is here, where meta tags are not correct:

If I open website in web browser, I can see correct meta tags in web inspector:

Khauri · Accepted Answer

I might need more clarification on what a "visual" DOM structure is, but as a commenter pointed out a headless browser like puppeteer is probably the way to go when a website has complex loading behavior.

The advantage here is, with puppeteer at least, you can navigate to a page and then programmatically wait until some condition is satisfied before continuing. In this case, I chose to wait until one of the meta tags you specified's content attribute is truthy, but depending on your needs you could wait for something else or even wait for multiple conditions to be true.

You might have to analyze the behavior of the page in question a little deeper to figure out what you should wait for though, but at the very least the following code seems to correctly load the tags in your question.

import puppeteer from 'puppeteer'

(async ()=>{
  const url = 'https://www.washingtonpost.com/news/acts-of-faith/wp/2017/06/30/trump-promised-to-destroy-the-johnson-amendment-congress-is-targeting-it-now/'
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto(url)
  // wait until  has a truthy value for content attribute
  await page.waitForFunction(()=>{
    return document.querySelector('meta[property="og:title"]').getAttribute('content')
  })
  const html = await page.content()
  console.log(html)
  await browser.close()
})()

(pastebin of formatted html result)

Also, since this solution uses puppeteer I'd recommend not working with the html string and instead using the puppeteer API to extract the information you need.

How to get visual DOM structure from url in node.js

Answers (1)

Related Questions