Reputation: 5043
I'm looking at things that can distinguish a blog from a normal website. These are things that a program needs to be able identify from the html of a website or particular features that a site supports. For eg. pings. The same for news websites.
I'm working on a blog/news monitor program and it will index sites to automatically determine if it is a blog or a news site and then monitor user feedback in comments etc on posts from sites that it determines to be of a blog or news nature.
So what i'm really after is suggestions on what i can use or look out for in identifying these sites.
It's going to be a desktop app written in java so if you have any code specifics in java that'll be great.
thanks in advance
Upvotes: 1
Views: 83
Reputation: 3732
Look for a discoverable RSS or Atom feed, which should be present on a blog or serially-updated news site.
Upvotes: 0
Reputation: 33082
You can search the page for the word "blog", as this will probably be present. Specifically, you can look for it in parts of the HTML page, or exclude parts - like links. This will give you a decent starting point.
Ultimately, though, this is something that will have to be done manually. You should construct an interface for people to specify if it's a blog or news site, or different features of it, when the site is submitted. Then you should create a database of sites and features, and flag them so that you or another administrator can review them and make changes. Once you do this for a site, you'll never need to do it again, so for example http://*.wordpress.com/ is all going to be blogs.
Some features you can automatically detect or get a pretty good chance of detecting, but ultimately you will need a manual review.
Upvotes: 1