Reputation: 21497
How can i scrape hacker news (https://news.ycombinator.com/) via x-ray/nodejs?
I would like to get something like this out of it:
[
{title1, comment1},
{title2, comment2},
...
{"‘Minimal’ cell raises stakes in race to harness synthetic life", 48}
...
{title 30, comment 30}
]
There is a news table but i dont know how to scrape it... Each of the stories on the website consists of three columns. These do not have a parent that is unique to them. So the structure looks like this
<tbody>
<tr class="spacer"> //Markup 1
<tr class="athing"> //Headline 1 ('.deadmark+ a' contains title)
<tr class> //Meta Information 1 (.age+ a contains comments)
<tr class="spacer"> //Markup 2
<tr class="athing"> //Headline 2 ('.deadmark+ a' contains title)
<tr class> //Meta Information 2 (.age+ a contains comments)
...
<tr class="spacer"> //Markup 30
<tr class="athing"> //Headline 30 ('.deadmark+ a' contains title)
<tr class> //Meta Information 30 (.age+ a contains comments)
So far i have tried:
x("https://news.ycombinator.com/", "tr", [{
title: [".deadmark+ a"],
comments: ".age+ a"
}])
and
x("https://news.ycombinator.com/", {
title: [".deadmark+ a"],
comments: [".age+ a"]
})
The 2nd approach returns 30 names and 29 comment-couts... I do not see any possibility to map them together as there is no information which of the 30 title's is missing a comment...
Any help appriciated
Upvotes: 1
Views: 1426
Reputation: 473903
The markup is not easy to scrape with X-ray
package since there is no way to reference the current context in a CSS selector. This would be useful to get the next tr
sibling after the tr.thing
row to get the comments.
We can still use the "next sibling" notation (the +
) to get to the next row, but, instead of targeting the optional comments link, we'll grab the complete row text and then extract the comments value with regular expressions. If no comments present, setting the value to 0
.
Complete working code:
var Xray = require('x-ray');
var x = Xray();
x("https://news.ycombinator.com/", {
title: ["tr.athing .deadmark+ a"],
comments: ["tr.athing + tr"]
})(function (err, obj) {
// extracting comments and mapping into an array of objects
var result = obj.comments.map(function (elm, index) {
var match = elm.match(/(\d+) comments?/);
return {
title: obj.title[index],
comments: match ? match[1]: "0"
};
});
console.log(result);
});
Currently prints:
[ { title: 'Follow the money: what Apple vs. the FBI is really about',
comments: '85' },
{ title: 'Unable to open links in Safari, Mail or Messages on iOS 9.3',
comments: '12' },
{ title: 'Gogs – Go Git Service', comments: '13' },
{ title: 'Ubuntu Tablet now available for pre-order',
comments: '56' },
...
{ title: 'American Tech Giants Face Fight in Europe Over Encrypted Data',
comments: '7' },
{ title: 'Moving Beyond the OOP Obsession', comments: '34' } ]
Upvotes: 4