Adam D
Adam D

Reputation: 2247

How to extract text from between any given pair of spans?

I am trying to use Cheerio and Node.js to extract text from an interesting bit of HTML.

Let's say I have the following HTML:

<p>
  <span class="sectionno" id="s1">1</span>
  Do you see that shelf?
  <span class="endsection"></span>
  <span class="sectionno" id="s2">2</span>The shelf is hanging
</p>
<p>on the wall</p>
<p>beside the clock.</p>
<h3>Title Here</h3>
<span class="endsection"></span>
<p>
  <span class="sectionno" id="s3">3</span>The clock
</p>
<p>was ticking slowly</p>
<p>telling time<span class="endsection"></span></p>

I want to be able to extract the following data, getting the text between each pair of span.sectionno and span.endsection:

[
  {
    no: 1,
    text: "Do you see that shelf?",
  },
  {
    no: 2,
    text: "The shelf is hanging on the wall, beside the clock.",
  },
  {
    no: 3,
    text: "The clock was ticking slowly telling time",
  },
]

Notice that I want to ignore any text in headings.‌‌‌ I tried things like this but I this just gives me the numbers at the beginning of each section:

const $ = cheerio.load(html);
const sections = [];

$("span.sectionno").each((_, el) => {
  const sectionNo = parseInt($(el).text());
  const text = $(el).nextUntil("span.endsection").addBack().text();
  sections.push({ no: sectionNo, text: text.trim() });
});

console.log(sections);
// [ { no: 1, text: '1' }, { no: 2, text: '2' }, { no: 3, text: '3' } ]

Because of the strange setup of the HTML I have been unable to successfully do this with Cheerio.

Upvotes: 2

Views: 146

Answers (2)

Peter Seliger
Peter Seliger

Reputation: 13432

Any good generic approach should consist of mainly 3 steps.

One first has to parse a document from the provided markup string, like with e.g. ...

const doc = new DOMParser()
  .parseFromString(markup, 'text/html');

Then one needs to query all sectionno classified element-nodes, like with e.g. ...

const sectionStartNodeList = doc.body
  .querySelectorAll('.sectionno');

The main task of aggregating a text-content item for each available section-start node gets achieved by a simple tree-walking process.

For each such entry-point one starts with extracting the item-count (no) of the to be created and returned text-item object. The very item's text property-value then gets aggregated by proceeding with the nextSibling of the currently processed node (either text-node or element-node). In case there is neither a next sibling nor an immediate match with an element-node that marks a section's end, one has to switch to this last node's parentNode's next sibling. Thats all what's needed for a successful tree walking.

In case the above described function has been named extractSectionTextContent, it can be applied directly via a map task which iterates the array-form of the before queried node-list ...

const sectionContentList = [...sectionStartNodeList]
  .map(extractSectionTextContent);

... example code ...

const markup = `
  <p>
    <span class="sectionno" id="s1">1</span>
    Do you see that shelf?
    <span class="endsection"></span>
    <span class="sectionno" id="s2">2</span>The shelf is hanging
  </p>
  <p>on the wall</p>
  <p>beside the clock.</p>
  <h3>Title Here</h3>
  <span class="endsection"></span>
  <p>
    <span class="sectionno" id="s3">3</span>The clock
  </p>
  <p>was ticking slowly</p>
  <p>telling time<span class="endsection"></span></p>
`;
const docBody = new DOMParser()
  .parseFromString(markup, 'text/html')
  .body;

const sectionStartNodeList = docBody
  .querySelectorAll('.sectionno');

console.log({ sectionStartNodeList: [...sectionStartNodeList] });

const sectionContentList = [...sectionStartNodeList]
  .map(extractSectionTextContent);

console.log({ sectionContentList });
.as-console-wrapper { bottom: auto; right: auto; top: 0; min-height: 100%; }
<script>
function extractSectionTextContent(node) {

  const contentList = [];
  const textItemCount = node.textContent.trim();

  let textValue;

  while (
    (node = node.nextSibling || node.parentNode.nextSibling) &&
    !node.classList?.contains('endsection')
  ) {
    if (node.nodeType === Node.TEXT_NODE) {

      textValue = node.nodeValue.trim();

    } else if (
      (node.nodeType === Node.ELEMENT_NODE) &&

      // OP ... "Notice that I want to ignore any text in headings."
      !/^h[1-6]$/.test(node.tagName.toLowerCase())
    ) {

      textValue = node.textContent.trim();
    }
    if (textValue) {
      contentList.push(textValue);
    }
  }

  return {
    no: textItemCount,
    text: contentList.join(' '), 
  };
}
</script>

Edit ... regarding the next quoted follow-up comments after having provided the above solution ...

This is nice! But this runs in the browser, I am trying to do this in node and it doesn't seem to quite work with using jsdom instead? – Adam D

@AdamD ... everything provided above runs in node.js too. What you have to look for is a DOMParser like node package/module or make use of e.g. the jsdom package. – Peter Seliger

The jsdom library fails at traversing a DOM-like model as it is required for any c/lean solution to the OP's problem. But ershov-konst's dom-parser package provides some basic dom-walking capability.

Thus the next provided code can be run in a node.js-environment.

The first introduced approach can be kept entirely. Just some implementation details have to be changed slightly in order to reflect the model-differences which are introduced by the dom-parser library.

This library for instance does not support a DOM-node's nextSibling property, thus, one has to implement and utilize an own getNextSibling function that works upon any node's parentNode's childNodes-array which both are dom-parser supported properties.

... example code, capable of being executed within a node.js environment ...

const markup = `
  <p>
    <span class="sectionno" id="s1">1</span>
    Do you see that shelf?
    <span class="endsection"></span>
    <span class="sectionno" id="s2">2</span>The shelf is hanging
  </p>
  <p>on the wall</p>
  <p>beside the clock.</p>
  <h3>Title Here</h3>
  <span class="endsection"></span>
  <p>
    <span class="sectionno" id="s3">3</span>The clock
  </p>
  <p>was ticking slowly</p>
  <p>telling time<span class="endsection"></span></p>
`;
function main(markup) {

  const domParserRoot = domParser
    .parseFromString(`<div>${ markup }</div>`);

  const sectionStartNodeList = domParserRoot
    .getElementsByClassName('sectionno');

  console.log({ sectionStartNodeList });

  const sectionContentList = [...sectionStartNodeList]
    .map(extractSectionTextContent);

  console.log({ sectionContentList });
}
document
  .addEventListener('DOMContentLoaded', () => main(markup));
.as-console-wrapper { bottom: auto; right: auto; top: 0; min-height: 100%; }
<script type="module">
  import * as domParser from 'https://cdn.jsdelivr.net/npm/[email protected]/+esm';
  
  window.domParser = domParser;
</script>

<script>
function getNextSibling(node) {
  const siblingNodes = node.parentNode?.childNodes ?? [];

  return siblingNodes
    .at(siblingNodes.indexOf(node) + 1) ?? null;
}
function extractSectionTextContent(node) {

  const contentList = [];
  const textItemCount = node.textContent.trim();

  let classAttr;
  let textValue;

  while (
    (node = getNextSibling(node) || getNextSibling(node.parentNode)) &&
    (classAttr = node.attributes.find(({ name }) => name === 'class') ?? {}) &&
    !/\bendsection\b/.test(classAttr.value ?? '')
  ) {
    if (node.nodeType === 3) {

      textValue = node.text.trim();

    } else if (
      (node.nodeType === 1) &&

      // OP ... "Notice that I want to ignore any text in headings."
      !/^h[1-6]$/.test(node.nodeName)
    ) {

      textValue = node.textContent.trim();
    }
    if (textValue) {
      contentList.push(textValue);
    }
  }

  return {
    no: textItemCount,
    text: contentList.join(' '), 
  };
}
</script>

Upvotes: 2

pguardiario
pguardiario

Reputation: 55002

Untested but I think it looks like:

$('span.sectionno').get().map(span => {
  let no = Number($(span).text())
  let text = span.nextSibling?.data?.trim()
  return { no, text }
})

Because you're getting the text of the text node after the span (you can't get those directly with cheerio)

Upvotes: 0

Related Questions