Reputation: 43
Currently I am trying to scrape information from a web page to use in a discord bot from the following site: https://scp-wiki.wikidot.com/personnel-and-character-dossier (Agents Tab)
I haven't used much of the Cheerio library and composed my current code with the help of an article. (https://www.scrapingbee.com/blog/node-fetch/#scraping-the-web-with-node-fetch-and-cheerio)
const fs = require('fs');
const cheerio = require('cheerio');
const fetch = (...args) => import('node-fetch').then(({default: fetch}) => fetch(...args));
async function test() {
const response = await fetch('https://scp-wiki.wikidot.com/personnel-and-character-dossier');
const body = await response.text();
// parse the html text and extract titles
const $ = cheerio.load(body);
const titleList = [];
// wiki-tab-0-2 is id of Agents tab in dossier
// using CSS selector
$('#wiki-tab-0-2').each((i, title) => {
const titleNode = $(title);
const titleText = titleNode.text();
titleList.push(titleText);
});
console.log(titleList);
}
test()
What I would like to do is split the text in separate indexes, the text in that index will be between the prior <hr> and succeeding <hr> element.
But I can't figure out how I would do that.
Any extra documentation, resources or advice that could further my knowledge would be greatly appreciated.
The expected output would be for the "Active Foundation Field Agents." text to be in the first index, the second index to be all the Text regarding Agent Green, third index all the text regarding Agent Travis Kazmarek and so on.
titleList["Active Foundation Field Agents.", "Agent Green", "Agent Travis Kazmarek"]
Upvotes: 0
Views: 148
Reputation: 244
Based on your edits, and after looking at the page, I think you want to target #wiki-tab-0-2 > p
instead of #wiki-tab-0-2
.
// wiki-tab-0-2 is id of Agents tab in dossier
// using CSS selector
$('#wiki-tab-0-2 > p').each((i, title) => {
const titleNode = $(title);
const titleText = titleNode.text();
titleList.push(titleText);
});
Upvotes: 1