janedoe
janedoe

Reputation: 937

DOM traversal with cheerio - How to get all the elements with their corresponding text

So I'm using Cheerio, a library similar to jQuery on the Node server side, that allows you to parse an html text and traverse it just like you would with jQuery. I need to get the plain text of the html body, but not only that, I need to get the corresponding element and number. IE: if the plain text was found in the third paragraph element, I would have something like:

{
    text: <element plaintext>,
    element: "p-3"
}

I currently have the following function that attempts to do this:

var plaintext_elements = traverse_tree($('body'));    

function traverse_tree(root, found_elements = {}, return_array = []) {
    if (root.children().length) {
        //root has children, call traverse_tree on that subtree
        traverse_tree(root.children().first(), found_elements, return_array);
    }
    root.nextAll().each(function(i, elem) {
        if ($(elem).children().length) {
            //if the element has children call traverse_tree on the element's first child
            traverse_tree($(elem).children().first(), found_elements, return_array)
        }
        else {
            if (!found_elements[$(elem)[0].name]) {
                found_elements[$(elem)[0].name] = 1;
            }
            else {
                found_elements[$(elem)[0].name]++
            }
            if ($(elem).text() && $(elem).text != '') {
                return_array.push({
                    text: $(elem).text(),
                    element: $(elem)[0].name + '-' + found_elements[$(elem)[0].name]
                })
            }
        }
    })


    if (root[0].name == 'body') {
        return return_array;
    }

}

Am I going in the right direction, should I attempt something else? Any help on this would be appreciated. Again this is not jQuery, but Cheerio on the server side. (they are very similar, however)

Upvotes: 3

Views: 3618

Answers (2)

ggorlen
ggorlen

Reputation: 56855

How about something like:

const cheerio = require("cheerio"); // 1.0.0-rc.12

const html = `<!DOCTYPE html>
<html><body>
<div>
  <p>
    foo
    <b>bar</b>
  </p>
  <p>
    baz
    <b>quux</b>
    garply
  </p>
  corge
</div>
</body>
</html>`;

const $ = cheerio.load(html);
const indices = {};
const seen = new Map();
const els = [...$("*")]
  .flatMap(e =>
    [...$(e).contents()].filter(
      e => e.type === "text" && $(e).text().trim()
    )
  )
  .map(e => {
    const text = $(e).text().trim();
    const {parent} = e;
    const {name: element} = parent;

    if (!seen.has(parent)) {
      indices[element] = ++indices[element] || 0;
      seen.set(parent, indices[element]);
    }

    return {text, element, nth: seen.get(parent)};
  });
console.log(els);

Output:

[
  { text: 'corge', element: 'div', nth: 0 },
  { text: 'foo', element: 'p', nth: 0 },
  { text: 'bar', element: 'b', nth: 0 },
  { text: 'baz', element: 'p', nth: 1 },
  { text: 'garply', element: 'p', nth: 1 },
  { text: 'quux', element: 'b', nth: 1 }
]

This uses .contents() and filters out any non-text nodes and whitespace only text nodes. .parent() gives access to the tag corresponding to each text node.

I'm not entirely sure what your numbering requirement specification is, but since "*" returns elements in order, we should be able to track references to each element along with an index matching the first time we encountered it, which is applied to any subsequent children we may encounter.

Upvotes: 0

Matt
Matt

Reputation: 74620

I think a lot of the traversal is not needed if you use the * css selector

function textElements($){
  const found = {}
  return $('body *').map(function(el){
    if ( $(this).children().length || $(this).text() === '' ) return
    found[this.name] = found[this.name] ? 1 + found[this.name] : 1
    return {
      text: $(this).text(),
      element: `${this.name}-${found[this.name]}`,
    }
  }).get()
}

textElements(cheerio.load(html)

Upvotes: 0

Related Questions