How to extract text from between any given pair of spans?

Question

I am trying to use Cheerio and Node.js to extract text from an interesting bit of HTML.

Let's say I have the following HTML:


  1
  Do you see that shelf?
  
  2The shelf is hanging

on the wall
beside the clock.
Title Here


  3The clock

was ticking slowly
telling time

I want to be able to extract the following data, getting the text between each pair of span.sectionno and span.endsection:

[
  {
    no: 1,
    text: "Do you see that shelf?",
  },
  {
    no: 2,
    text: "The shelf is hanging on the wall, beside the clock.",
  },
  {
    no: 3,
    text: "The clock was ticking slowly telling time",
  },
]

Notice that I want to ignore any text in headings.‌‌‌ I tried things like this but I this just gives me the numbers at the beginning of each section:

const $ = cheerio.load(html);
const sections = [];

$("span.sectionno").each((_, el) => {
  const sectionNo = parseInt($(el).text());
  const text = $(el).nextUntil("span.endsection").addBack().text();
  sections.push({ no: sectionNo, text: text.trim() });
});

console.log(sections);
// [ { no: 1, text: '1' }, { no: 2, text: '2' }, { no: 3, text: '3' } ]

Because of the strange setup of the HTML I have been unable to successfully do this with Cheerio.

Peter Seliger · Accepted Answer

Any good generic approach should consist of mainly 3 steps.

One first has to parse a document from the provided markup string, like with e.g. ...

const doc = new DOMParser()
  .parseFromString(markup, 'text/html');

Then one needs to query all sectionno classified element-nodes, like with e.g. ...

const sectionStartNodeList = doc.body
  .querySelectorAll('.sectionno');

The main task of aggregating a text-content item for each available section-start node gets achieved by a simple tree-walking process.

For each such entry-point one starts with extracting the item-count (no) of the to be created and returned text-item object. The very item's text property-value then gets aggregated by proceeding with the nextSibling of the currently processed node (either text-node or element-node). In case there is neither a next sibling nor an immediate match with an element-node that marks a section's end, one has to switch to this last node's parentNode's next sibling. Thats all what's needed for a successful tree walking.

In case the above described function has been named extractSectionTextContent, it can be applied directly via a map task which iterates the array-form of the before queried node-list ...

const sectionContentList = [...sectionStartNodeList]
  .map(extractSectionTextContent);

... example code ...

const markup = `
  
    1
    Do you see that shelf?
    
    2The shelf is hanging
  
  on the wall
  beside the clock.
  Title Here
  
  
    3The clock
  
  was ticking slowly
  telling time
`;
const docBody = new DOMParser()
  .parseFromString(markup, 'text/html')
  .body;

const sectionStartNodeList = docBody
  .querySelectorAll('.sectionno');

console.log({ sectionStartNodeList: [...sectionStartNodeList] });

const sectionContentList = [...sectionStartNodeList]
  .map(extractSectionTextContent);

console.log({ sectionContentList });

.as-console-wrapper { bottom: auto; right: auto; top: 0; min-height: 100%; }

Edit ... regarding the next quoted follow-up comments after having provided the above solution ...

This is nice! But this runs in the browser, I am trying to do this in node and it doesn't seem to quite work with using jsdom instead? – Adam D

@AdamD ... everything provided above runs in node.js too. What you have to look for is a DOMParser like node package/module or make use of e.g. the jsdom package. – Peter Seliger

The jsdom library fails at traversing a DOM-like model as it is required for any c/lean solution to the OP's problem. But ershov-konst's dom-parser package provides some basic dom-walking capability.

Thus the next provided code can be run in a node.js-environment.

The first introduced approach can be kept entirely. Just some implementation details have to be changed slightly in order to reflect the model-differences which are introduced by the dom-parser library.

This library for instance does not support a DOM-node's nextSibling property, thus, one has to implement and utilize an own getNextSibling function that works upon any node's parentNode's childNodes-array which both are dom-parser supported properties.

... example code, capable of being executed within a node.js environment ...

const markup = `
  
    1
    Do you see that shelf?
    
    2The shelf is hanging
  
  on the wall
  beside the clock.
  Title Here
  
  
    3The clock
  
  was ticking slowly
  telling time
`;
function main(markup) {

  const domParserRoot = domParser
    .parseFromString(`${ markup }`);

  const sectionStartNodeList = domParserRoot
    .getElementsByClassName('sectionno');

  console.log({ sectionStartNodeList });

  const sectionContentList = [...sectionStartNodeList]
    .map(extractSectionTextContent);

  console.log({ sectionContentList });
}
document
  .addEventListener('DOMContentLoaded', () => main(markup));

.as-console-wrapper { bottom: auto; right: auto; top: 0; min-height: 100%; }

How to extract text from between any given pair of spans?

Answers (2)

Related Questions