Reputation: 6394
I know I can download Wikipedia entirely. But I wonder if there is any way to download it by category? They have the Special Export page but writing a category (e.g. Culture) adds the children pages and another bunch of categories, so trying to get all the pages in Culture is gonna take "forever", because as you submit another sub categories of culture another show up. Do you guys know of any other way to export it by category? (in a simple way)
Upvotes: 5
Views: 3208
Reputation: 50328
Using the MediaWiki API, you can get the wikitext of all pages in a category by using list=categorymembers
as a generator for a prop=revisions
query, like this:
This example link gives the content of the first 10 articles in Category:Culture on Wikipedia. You can add the gcmlimit=max
parameter to get more pages, but for large categories you'll need to handle query continuations properly (or use a MediaWiki API client that handles them for you).
(However, this query won't show pages in subcategories of Category:Culture. If you want those too, you can get a list of the pages and subcategories in a category using a simple categorymembers
query without cmnamespace
and recurse through the results to collect a list of article titles to export. If you do that, be careful not to get caught in any category loops, and preferably do a sanity check on the results before exporting the pages — it's very easy to get way more pages than you expected from a full subcategory traversal.)
Upvotes: 5
Reputation: 244787
I don't think there is any other simple way to do that.
I think your best bet is to download the dump file of all articles (pages-articles
, currently 7.5 GB for the English Wikipedia) and filter them by category, possibly using the category membership dump (categorylinks
, 1 GB).
Another option is do something similar to what you would do using Special:Export manually, but automate it using the API.
Upvotes: 3