Reputation: 75
I am currently trying to write a scraper that will get all the 'p' tags from within a div inside a facebook post using node.js
Each of the posts on the page lie within div's that all have this class: .text_exposed_root
There is sometimes multiple 'p' tags within each post so ideally i need to grab all of the html text within that div if possible. I am using cheerio and request modules and my code so far is below:
request(BTTS, function(error, response, body){
if (!error){
var $ = cheerio.load(body),
post = $(".text_exposed_root p").text();
console.log(post);
} else {
console.log("We’ve encountered an error: " + error);
}
})
I have tried using .text .value and .html but they all just return a blank response. I'm guessing I would need to grab all the 'p' tags within that div and convert to a string maybe?
Thanks in advance.
EDITED:
var url = ('https://www.facebook.com/BothTeamsToScore');
request({url:url, headers: headers}, function(error, response, body){
if (!error){
var strippedBody = body.replace(/<!--[\s\S]*?-->/g, "")
console.log(strippedBody);
var $ = cheerio.load(strippedBody),
post = $(".text_exposed_root p").text();
console.log(post);
} else {
console.log("We’ve encountered an error: " + error);
}
})
Upvotes: 1
Views: 3808
Reputation: 36
There is this library node-html-parser which you can use to parse a html string and then use it to perform DOM-like manipulation.
In your case you can create a function that queries all divs by .text_exposed_root then extract the innerHtml
File: html.utiliy.ts
import parse from "node-html-parser";
export class HTMLUtility {
/**
* Function that queries all HTML text by given class name
* @param text actual Html Text as string
* @param className targetted class name
* @returns a list with inner html text found inside class name
*/
public static getAllParagraphsByDivClass(text: string, className: string): string[] {
const root = parse(text);
const htmlDivElement = root.querySelectorAll(`.${className}`);
return htmlDivElement.map((m) => m.innerHTML);
}
}
File: html.utility.unit.test.ts
import { describe, it } from "mocha";
import { assert } from "chai";
import { HTMLUtility } from "../../../../src/application/helpers/html.utiliy";
describe("HTML Utility", () => {
describe("get all elements inside div", () => {
it("given certain html, will return a list of text inside each div.className", async () => {
// arrange
const post1 = "<p>Lorem ipsum dolor sit amet</p><p>consectetur adipiscing elit.</p>";
const post2 = "<p>Aliquam iaculis ornare massa</p><p>ut porta enim mollis ac.</p>";
const post3 = "<p>Maecenas sodales pretium sollicitudin.</p>";
const htmlText = `
<body>
<div class='text_exposed_root'>${post1}</div>
<div class='text_exposed_root'>${post2}</div>
<div class='text_exposed_root'>${post3}</div>
</body>`;
// act
const paragraphList = HTMLUtility.getAllParagraphsByDivClass(htmlText, "text_exposed_root");
// assert
assert.equal(paragraphList[0], post1);
assert.equal(paragraphList[1], post2);
assert.equal(paragraphList[2], post3);
});
});
});
Upvotes: 1
Reputation: 10722
First of all, you're going to need to set some headers with your request. Without them, Facebook will respond with and "unsupported browser" page. That's your first problem.
var headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36',
'Content-Type' : 'application/x-www-form-urlencoded'
}
var url = BTTS
request({url:url, headers: headers}, function(error, response, body){
if (!error){
var $ = cheerio.load(body.replace(/<!--|-->/g, ''))
console.log($('.text_exposed_root p').text())
} else {
console.log("We’ve encountered an error: " + error);
}
})
The other thing that should be noted, is that the content comes in inside of an html comment. ie
<code class="hidden_elem"><!--
...
<div class="text_exposed_root">
<p>text</p>
Cheerio will not parse comment nodes, so you'll most likely need to remove the <!--
and -->
and load the result back into cheerio to parse the part of html that you want. Good luck!
Upvotes: 2