Reputation: 7765
I want to strip some elements and comments from the DOM within Puppeteer. These items do not have identifiable IDs, classes, or attributes which I can select using CSS. However, they may be identified by internal strings, and some elements may be wrapped in human-readable comments. My attempts so far:
contains()
selector. So I tried to do it with XPath...Any ideas? Thanks.
So, in the following example, I would like to delete the elements between the <!-- DELETE ME ... -->
comments, as well as the <!-- DELETE ME ... -->
comments at the end:
<html>
<head>
<!-- DELETE ME BEGIN -->
<script>
// delete me
console.log('delete me')
</script>
<!-- DELETE ME END -->
<title>Page Title</title>
</head>
<body>
<!-- DELETE ME BEGIN -->
<style>
body {
/* delete me */
color: red;
}
</style>
<script>
// delete me
console.log('delete me')
</script>
<!-- DELETE ME END-->
<style>
body {
/* keep me */
color: green;
}
</style>
<script>
// keep me
console.log("keep me")
</script>
<p>Keep me</p>
<!-- keep me -->
</body>
</html>
<!-- DELETE ME -->
<!-- DELETE ME TOO -->
Puppeteer/XPath code (just an attempt, does not yet do anything):
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.on("console", (log) => console[log._type](log._text));
const html = await page.evaluate(() => {
var evaluator = new XPathEvaluator();
var result = evaluator.evaluate(
"//script[contains(.,'delete me')]",
document,
null,
XPathResult.ANY_TYPE
);
console.log(result);
return document.documentElement.outerHTML;
});
await browser.close();
Upvotes: 1
Views: 672
Reputation: 7765
Note for future self, here is the full code I wrote incorporating @sam-r solution, in this case stripping elements added to rendered Wayback Machine entry:
// remove elements by XPath
[
...await page.$x("//script[contains(.,'__wm')]"),
...await page.$x("//script[contains(.,'archive.org')]"),
...await page.$x("//style[contains(.,'margin-top:0 !important;\n padding-top:0 !important;\n /*min-width:800px !important;*/')]"),
...await page.$x("//comment()[contains(.,'WAYBACK')]"),
...await page.$x("//comment()[contains(.,'Wayback')]"),
...await page.$x("//comment()[contains(.,'playback timings (ms)')]"),
].forEach(async xpath => await page.evaluate(el => el.remove(), xpath));
// remove elements by CSS Selector
await page.evaluate(async () => {
[
document.querySelector('link[href*="/_static/css/banner-styles.css"]'),
document.querySelector('link[href*="/_static/css/iconochive.css"]'),
...document.querySelectorAll("#wm-ipp-base"), // wayback header
...document.querySelectorAll('script[src*="wombat.js"]'),
...document.querySelectorAll('script[src*="archive.org"]'),
...document.querySelectorAll('script[src*="playback.bundle.js"]'),
...document.querySelectorAll("#donato"), // wayback donation header
].forEach((element) => element.remove());
});
Upvotes: 0
Reputation: 16450
Your xpath
looks correct. Puppeteer provides page
.$x
(expression) functions to run the xpath
:
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://storm-bald-meteorology.glitch.me');
let xs = await page.$x("//script[contains(. ,'delete me')]");
console.log(xs.length);
for (let x of xs) {
let txt = await page.evaluate(el => el.innerText, x);
console.log(txt);
}
await browser.close();
You can copy/paste this code into puppeteer playground to try it. I have also put your html
on glitch.
Upvotes: 1