Just A Question
Just A Question

Reputation: 540

How to get a text that's separated by different HTML tags in Cheerio

I'm trying to get the specific text strings below as separated outputs e.g. (scrape them from the HTML below):

let text = "Thats the first text I need";
let text2 = "The second text I need";
let text3 = "The third text I need";

I really don't know how to get a text that's separated by different HTML tags.

<p>
   <span class="hidden-text"><span class="ft-semi">Count:</span>31<br></span>
   <span class="ft-semi">Something:</span> That's the first text I need
   <span class="hidden-text"><span class="ft-semi">Something2:</span> </span>The second text I need
   <br><span class="ft-semi">Something3:</span> The third text I need
</p>

Upvotes: 4

Views: 1416

Answers (3)

seventeen
seventeen

Reputation: 443

If you need to get all children no matter how deep, and those children may not be text nodes even though they contain text, try

[...$(selector).find("*")] 
        .filter((e) => {
          return (
            !$(e).parent().is(selector) && $(e).text().length > 0
          );
        })
        .map((e) => $(e).text()) 
        .filter((value, index, self) => self.indexOf(value) === index)
        .join(" ");

Upvotes: 0

ggorlen
ggorlen

Reputation: 57394

You can iterate the child nodes of the <p> and grab any nodeType === Node.TEXT_NODEs that have nonempty content:

for (const e of document.querySelector("p").childNodes) {
  if (e.nodeType === Node.TEXT_NODE && e.textContent.trim()) {
    console.log(e.textContent.trim());
  }
}

// or to make an array:
const result = [...document.querySelector("p").childNodes]
  .filter(e =>
    e.nodeType === Node.TEXT_NODE && e.textContent.trim()
  )
  .map(e => e.textContent.trim());
console.log(result);
<p>
  <span class="hidden-text">
    <span class="ft-semi">Count:</span>
    31
    <br>
  </span>
  <span class="ft-semi">Something:</span>
  That's the first text I need
  <span class="hidden-text">
    <span class="ft-semi">Something2:</span>
  </span>
  The second text I need
  <br>
  <span class="ft-semi">Something3:</span>
  The third text I need
</p>

In Cheerio:

const cheerio = require("cheerio"); // 1.0.0-rc.12

const html = `
<p>
  <span class="hidden-text">
    <span class="ft-semi">Count:</span>
    31
    <br>
  </span>
  <span class="ft-semi">Something:</span>
  That's the first text I need
  <span class="hidden-text">
    <span class="ft-semi">Something2:</span>
  </span>
  The second text I need
  <br>
  <span class="ft-semi">Something3:</span>
  The third text I need
</p>
`;

const $ = cheerio.load(html);
const result = [...$("p").contents()]
  .filter(e => e.type === "text" && $(e).text().trim())
  .map(e => $(e).text().trim());

console.log(result);

Upvotes: 4

Jack Fleeting
Jack Fleeting

Reputation: 24940

Try something like this and see if it works:

html = `your sample html above`

domdoc = new DOMParser().parseFromString(html, "text/html")
result = domdoc.evaluate('//text()[not(ancestor::span)]', domdoc, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);

for (let i = 0; i < result.snapshotLength; i++) {
  target = result.snapshotItem(i).textContent.trim()
  if (target.length > 0) {
    console.log(target);
  }
}

Using your sample html, the output should be:

"That's the first text I need"
"The second text I need"
"The third text I need"

Upvotes: 0

Related Questions