Reputation: 937
So I'm using Cheerio, a library similar to jQuery on the Node server side, that allows you to parse an html text and traverse it just like you would with jQuery. I need to get the plain text of the html body, but not only that, I need to get the corresponding element and number. IE: if the plain text was found in the third paragraph element, I would have something like:
{
text: <element plaintext>,
element: "p-3"
}
I currently have the following function that attempts to do this:
var plaintext_elements = traverse_tree($('body'));
function traverse_tree(root, found_elements = {}, return_array = []) {
if (root.children().length) {
//root has children, call traverse_tree on that subtree
traverse_tree(root.children().first(), found_elements, return_array);
}
root.nextAll().each(function(i, elem) {
if ($(elem).children().length) {
//if the element has children call traverse_tree on the element's first child
traverse_tree($(elem).children().first(), found_elements, return_array)
}
else {
if (!found_elements[$(elem)[0].name]) {
found_elements[$(elem)[0].name] = 1;
}
else {
found_elements[$(elem)[0].name]++
}
if ($(elem).text() && $(elem).text != '') {
return_array.push({
text: $(elem).text(),
element: $(elem)[0].name + '-' + found_elements[$(elem)[0].name]
})
}
}
})
if (root[0].name == 'body') {
return return_array;
}
}
Am I going in the right direction, should I attempt something else? Any help on this would be appreciated. Again this is not jQuery, but Cheerio on the server side. (they are very similar, however)
Upvotes: 3
Views: 3618
Reputation: 56855
How about something like:
const cheerio = require("cheerio"); // 1.0.0-rc.12
const html = `<!DOCTYPE html>
<html><body>
<div>
<p>
foo
<b>bar</b>
</p>
<p>
baz
<b>quux</b>
garply
</p>
corge
</div>
</body>
</html>`;
const $ = cheerio.load(html);
const indices = {};
const seen = new Map();
const els = [...$("*")]
.flatMap(e =>
[...$(e).contents()].filter(
e => e.type === "text" && $(e).text().trim()
)
)
.map(e => {
const text = $(e).text().trim();
const {parent} = e;
const {name: element} = parent;
if (!seen.has(parent)) {
indices[element] = ++indices[element] || 0;
seen.set(parent, indices[element]);
}
return {text, element, nth: seen.get(parent)};
});
console.log(els);
Output:
[
{ text: 'corge', element: 'div', nth: 0 },
{ text: 'foo', element: 'p', nth: 0 },
{ text: 'bar', element: 'b', nth: 0 },
{ text: 'baz', element: 'p', nth: 1 },
{ text: 'garply', element: 'p', nth: 1 },
{ text: 'quux', element: 'b', nth: 1 }
]
This uses .contents()
and filters out any non-text nodes and whitespace only text nodes. .parent()
gives access to the tag corresponding to each text node.
I'm not entirely sure what your numbering requirement specification is, but since "*"
returns elements in order, we should be able to track references to each element along with an index matching the first time we encountered it, which is applied to any subsequent children we may encounter.
Upvotes: 0
Reputation: 74620
I think a lot of the traversal is not needed if you use the *
css selector
function textElements($){
const found = {}
return $('body *').map(function(el){
if ( $(this).children().length || $(this).text() === '' ) return
found[this.name] = found[this.name] ? 1 + found[this.name] : 1
return {
text: $(this).text(),
element: `${this.name}-${found[this.name]}`,
}
}).get()
}
textElements(cheerio.load(html)
Upvotes: 0