Reputation: 770
Since StackOverflow comes with a wealth of questions and user-contributed tags, I am looking at it as an interesting, richly annotated, text corpus for NLP (natural language processing) tasks.
Basically, I want to automatically predict question tags based on the questions body. I am sure this can be done to a certain extend, and there's a number of nice use cases, such as tag suggestions (e.g. to make tag usage more consistent), to name just one.
For this I would need a lot - or even better: - all questions along with their body text and user tags to train a tag predicter with machine learning algorithms.
I know there's the StackOverflow API, but the amount of data I can fetch through it seems to be very limited - for good reasons of course.
So the question is: Is there a way to fetch/download all questions along with their user-tags from StackOverflow?
Upvotes: 1
Views: 233
Reputation: 881623
You can get the data dump at http://www.clearbits.net/torrents/2076-aug-2012, sans the meta sites, a minor oversight which has been fixed with an alternate release, but is not applicable to your request.
Upvotes: 1