Reputation: 419
I was using curl to scrape html code from a certain website. then they changed their server settings and curl no longer can get the page content giving error code 1020 then I changed my script to use elinks.
but again they are now using cloudflare and elinks no longer works (only in this particular website). and it gives the same error code 1020.
is there any command line or option to use other browsers (firefox,chromium, google-chrome...) and get the page html in a terminal ?
Upvotes: 1
Views: 1129
Reputation: 320
I bring to your attention the code and libraries that bypass protection cloudflare:
Libs:
npm i puppeteer-extra puppeteer-extra-plugin-stealth puppeteer
nodejs:
const puppeteer = require('puppeteer-extra')
const pluginStealth = require('puppeteer-extra-plugin-stealth')
const { executablePath } = require('puppeteer')
const link = 'https://www.g2.com/'
const getHtmlThoughCloudflare = async (url) => {
puppeteer.use(pluginStealth())
const result = await puppeteer
.launch({ headless: true })
.then(async (browser) => {
const page = await browser.newPage()
await page.goto(url)
const html = await page.content()
await browser.close()
return html
})
console.log(` HTML: ${result}`)
return result // html
}
getHtmlThoughCloudflare(link)
Upvotes: 0
Reputation: 13822
If you can write scripts for Node.js, here is a small example using puppeteer library. It logs page source code after the page is loaded in a headless (invisible) Chrome, with dynamic content generated by page scripts:
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch({ headless: false, defaultViewport: null });
try {
const [page] = await browser.pages();
await page.goto('https://example.org/');
console.log(await page.content());
} catch (err) { console.error(err); } finally { await browser.close(); }
Upvotes: 1