yayu
yayu

Reputation: 8098

Hacker News: how to extract comment hierarchy

I was trying to parse a comment thread on the forum news.ycombinator.com. However, after looking at the html it seems there is no hierarchy to nest comments. This would make it really difficult to parse. For example, here is a parent comment and its child:

<!-- This part below draws the upvote/downvote images -->
<table border=0><tr><td><table border=0><tr><td><img src="http://ycombinator.com/images/s.gif" height=1 width=0></td><td valign=top><center><a id=up_4241971 href="vote?for=4241971&dir=up&whence=%69%74%65%6d%3f%69%64%3d%34%32%34%31%37%38%34"><img src="http://ycombinator.com/images/grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_4241971></span></center></td><td class="default"><div style="margin-top:2px; margin-bottom:-10px; ">


<!-- This part below is user/time and permalink info for a parent comment -->
<span class="comhead"><a href="user?id=JshWright">JshWright</a> 7 hours ago  | <a href="item?id=4241971">link</a></span></div><br>


<!-- This part below is actual Comment -->
<span class="comment"><font color=#000000>I just got my Verizon Galaxy S3, and ordered the 20-pack of NFC tags offered by <a href="http://tagsfordroid.com" rel="nofollow">http://tagsfordroid.com</a><p>I think I know what my Dad felt like when he got his first label printer... Within days it seemed like every object in his office was labeled...<p>I've got a tag in my car to automatically send my wife a "Headed home" SMS, a tag on my night stand to toggle between 'night' (silent) and 'day' (loud) volume settings, a tag by my back door to launch CardioTrainer when I go out for a run (this one may have crossed the "I've run out of ideas" line...). I'm using the keychain tag to dial a response number for the fire department I'm a member of.</font></span><p><font size=1><u><a href="reply?id=4241971&whence=%69%74%65%6d%3f%69%64%3d%34%32%34%31%37%38%34">reply</a></u></font></td></tr></table></td></tr>


<!-- This part below is upvote/downvote arrow for child of parent -->
<tr><td><table border=0><tr><td><img src="http://ycombinator.com/images/s.gif" height=1 width=40></td><td valign=top><center><a id=up_4242025 href="vote?for=4242025&dir=up&whence=%69%74%65%6d%3f%69%64%3d%34%32%34%31%37%38%34"><img src="http://ycombinator.com/images/grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_4242025></span></center></td><td class="default"><div style="margin-top:2px; margin-bottom:-10px; ">

<!-- This part has user/time/permalink for child comment -->
<span class="comhead"><a href="user?id=msbmsb">msbmsb</a> 7 hours ago  | <a href="item?id=4242025">link</a></span></div><br>

<!-- This part is the content of the  child comment -->
<span class="comment"><font color=#000000>I did the same thing. Tag next to the entry-way light switch for changing to an "at-home" profile, tag next to the bed for switching between night mode and morning mode, tag at work, keychain tag for switching between car mode and quiet mode.<p>And profile switching is just the basics. You can have a tag that connects guests' NFC-enabled phones to your wifi without having to hand out the password, for instance.<p>NFC task launcher + tasker is an amazing combination that opens up all kinds of possibilities.</font></span><p><font size=1><u><a href="reply?id=4242025&whence=%69%74%65%6d%3f%69%64%3d%34%32%34%31%37%38%34">reply</a></u></font></td></tr></table></td></tr><tr><td>

So how does hacker news store the hierarchial structure of the comments, and how can I replicate it when I am scraping their data?

Upvotes: 0

Views: 379

Answers (1)

Managu
Managu

Reputation: 9039

In the tables, the indenting is done by image tags:

...<td><img src="http://ycombinator.com/images/s.gif" height=1 width=0></td>...
...<td><img src="http://ycombinator.com/images/s.gif" height=1 width=40></td>...

Presumably you'd read and parse those. Reconstructing the actual threading represented could be done by keeping an internal stack of the width values.

Upvotes: 2

Related Questions