Reputation: 8942
I am wondering how to get "visual" DOM structure from url in node.js. When I try to get html content with request library, html structure is not correct.
const request = require('request');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
request({ 'https://www.washingtonpost.com/news/acts-of-faith/wp/2017/06/30/trump-promised-to-destroy-the-johnson-amendment-congress-is-targeting-it-now/', jar: true }, function (e, r, body) {
console.log(body);
});
reurned html structure is here, where meta tags are not correct:
<meta property="og:title" content=""/>
<meta itemprop="description" name="description" content=""/>
If I open website in web browser, I can see correct meta tags in web inspector:
<meta property="og:title" content="Trump promised to destroy the Johnson Amendment. Congress is targeting it now."/>
<meta itemprop="description" name="description" content="Observers believe the proposed legislation would make it harder for the IRS to enforce a law preventing pulpit endorsements."/>
Upvotes: 2
Views: 1438
Reputation: 3863
I might need more clarification on what a "visual" DOM structure is, but as a commenter pointed out a headless browser like puppeteer is probably the way to go when a website has complex loading behavior.
The advantage here is, with puppeteer at least, you can navigate to a page and then programmatically wait until some condition is satisfied before continuing. In this case, I chose to wait until one of the meta tags you specified's content attribute is truthy, but depending on your needs you could wait for something else or even wait for multiple conditions to be true.
You might have to analyze the behavior of the page in question a little deeper to figure out what you should wait for though, but at the very least the following code seems to correctly load the tags in your question.
import puppeteer from 'puppeteer'
(async ()=>{
const url = 'https://www.washingtonpost.com/news/acts-of-faith/wp/2017/06/30/trump-promised-to-destroy-the-johnson-amendment-congress-is-targeting-it-now/'
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto(url)
// wait until <meta property="og:title"> has a truthy value for content attribute
await page.waitForFunction(()=>{
return document.querySelector('meta[property="og:title"]').getAttribute('content')
})
const html = await page.content()
console.log(html)
await browser.close()
})()
(pastebin of formatted html result)
Also, since this solution uses puppeteer I'd recommend not working with the html string and instead using the puppeteer API to extract the information you need.
Upvotes: 3