Reputation: 17072
I'm apparently a little newer to Javascript than I'd care to admit. I'm trying to pull a webpage using Node.js and save the contents as a variable, so I can parse it however I feel like.
In Python, I would do this:
from bs4 import BeautifulSoup # for parsing
import urllib
text = urllib.urlopen("http://www.myawesomepage.com/").read()
parse_my_awesome_html(text)
How would I do this in Node? I've gotten as far as:
var request = require("request");
request("http://www.myawesomepage.com/", function (error, response, body) {
/*
Something here that lets me access the text
outside of the closure
This doesn't work:
this.text = body;
*/
})
Upvotes: 7
Views: 25406
Reputation: 1477
var request = require("request");
var parseMyAwesomeHtml = function(html) {
//Have at it
};
request("http://www.myawesomepage.com/", function (error, response, body) {
if (!error) {
parseMyAwesomeHtml(body);
} else {
console.log(error);
}
});
Edit: As Kishore noted, there are nice options for parsing available. Also see cheerio if you have python/gyp issues with jsdom on windows. Cheerio on github
Upvotes: 11
Reputation: 35253
That request()
call is asynchronous, so the response is only available inside the callback. You have to call your parse function from it:
function parse_my_awesome_html(text){
...
}
request("http://www.myawesomepage.com/", function (error, response, body) {
parse_my_awesome_html(body)
})
Get used to chaining callbacks, that's essentially how any I/O will happen in javascript :)
Upvotes: 3
Reputation: 1912
JsDom is pretty good to achieve things like this if you want to parse the response.
var request = require('request'),
jsdom = require('jsdom');
request({ uri:'http://www.myawesomepage.com/' }, function (error, response, body) {
if (error && response.statusCode !== 200) {
console.log('Error when contacting myawesomepage.com')
}
jsdom.env({
html: body,
scripts: [
'http://code.jquery.com/jquery-1.5.min.js'
]
}, function (err, window) {
var $ = window.jQuery;
// jQuery is now loaded on the jsdom window created from 'agent.body'
console.log($('body').html());
});
});
also if your page has lot of javascript/ajax content being loaded you might want to consider using phantomjs Source http://blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs/
Upvotes: 2