Reputation: 21150
I'm building a scraper in Node.js
and have come up against a slight problem. I'm trying to build a function which gets an element's text, regardless of whether it's embedded in a <p>
tag, in a <span>
or just a <div>
with text inside.
The following currently works ONLY for text contained in <p>
tags:
function getDescription(product){
var text =[];
$('.description *')
.each(function(i, elem) {
var dirty = $(this).text();
var clean = sanitize(dirty).trim();
if (clean.length){
text.push(clean);
}
});
text.join(',');
sanitize(text).trim();
return text;
}
This works for code like this:
<div class="description">
<p>Test test test</p>
</div>
But doesn't work for this:
<div class="description">
Test test test
</div>
For reference, the sanitize
and trim
functions are part of Node Validator
, but that's not particularly relevant to my problem - they just take a string and remove whitespace from it.
Any ideas on what I can do to make the one function work for BOTH instances? To add insult to injury, I'm slightly more limited as node
uses the cheerio
library to replicate some functions of jQuery
, but not all of them.
Upvotes: 1
Views: 235
Reputation: 16615
You can use innerText
:
var text =[];
$('.description').each(function(i, elem) {
var dirty = elem.innerText;
var clean = sanitize(dirty).trim();
if (clean.length){
text.push(clean);
}
});
Upvotes: 0
Reputation: 388316
Use .contents() instead of *
function getDescription(product){
var text =[];
$('.description').contents()
.each(function(i, elem) {
var dirty = $(this).text();
var clean = sanitize(dirty).trim();
if (clean.length){
text.push(clean);
}
});
text.join(',');
sanitize(text).trim();
return text;
}
Upvotes: 6