user3260254
user3260254

Reputation: 37

List all Wikipedia articles in one category and subcategories

Is there any way to get a list of all Wikipedia articles in one category, including all subcategories?

I tried to extract the links from the category page with a PHP script, but it seems like there is no possibility to get all articles including subcategories.

Upvotes: 3

Views: 4475

Answers (2)

Joce
Joce

Reputation: 2332

This can be done with PetScan, look e.g. at the example https://petscan.wmflabs.org/?psid=19820

You can choose how deep you go into subcategories and add esarch terms and/or exclude pages belonging to specified other categories.

Upvotes: 2

Ilmari Karonen
Ilmari Karonen

Reputation: 50328

You can do this using the MediaWiki API, specifically list=categorymembers.

Here's a random example:

The link above will give you a list of all pages in Category:Defunct airports in Prince Edward Island in XML format (pretty-printed for easier human readability by default). You can choose from a variety of machine-readable output formats by appending an appropriate parameter, such as format=xml or format=json, to the URL.

Note that, in general, the query shown above will include all pages in the category, including both articles and subcategories. You can limit it to articles only by including the parameter cmnamespace=0, but then you'll miss any subcategories. (You can always get those separately with cmnamespace=14, though.)

The reason you might want that information is that the list=categorymembers query by itself will not recurse into subcategories, so if you want that, you'll have to do it yourself. If you do that, though, be careful not to get caught in any category loops, and make sure you do a sanity check on the results — it's very easy to get way more pages than you expected from a full subcategory traversal.

Also, by default, a single categorymembers query will give you at most 10 results. You can increase that limit to 500 (or 5000, if you happen to have access to a bot-flagged account on Wikipedia) by including the parameter cmlimit=max in your query, but even then, very large categories might get cut off. If that happens, the query result will include a query continuation section that will tell you (or your MW API client library) how to obtain the rest of the pages using additional queries.


Edit: I kind of missed the fact that you were specifically asking about getting articles in subcategories. Here's some basic (untested!) example code on how you could do that using the Apibot 0.40 bridge interface (which I just picked randomly, because it looked like a decent PHP MW API client library, so that I wouldn't need to worry about details like query continuations):

function pages_under_category ( $category ) {
    global $bridge;  // I'll assume you've set this up in advance

    $queue = array( $category );  // categories to fetch
    $seen  = array( $category );  // categories already seen
    $pages = array();  // result pages (format: $title => array( $cat, ... ))

    while ( !empty( $queue ) ) {
        $cat = array_shift( $queue );

        $query = $bridge->query_list_categorymembers();
        $query->title = $cat;  // assume "Category:" prefix is included

        // fetch the contents of the category
        $query_result = $query->xfer();
        while ( $query_result ) {
            foreach ( $query->data as $page_data ) {

                $title = $page_data['title'];
                $namespace = $page_data['ns'];

                if ( $namespace == 0 ) {      // it's an article!
                    if ( !isset( $pages[$title] ) ) {
                        $pages[$title] = array();
                    }
                    $pages[$title][] = $cat;  // record where we found it
                }
                else if ( $namespace == 14 ) {  // it's a subcategory
                    if ( !in_array( $title, $seen ) ) {
                        $seen[] = $title;  // avoid loops!
                        $queue[] = $title;
                    }
                }
            }
            $query_result = $query->next();
        }
    }
    return $pages;
}

One feature you might want to add to the code above is some kind of limit on the result size / number of iterations, so that even if the recursive retrieval somehow finds its way to, say, Category:Contents, it will at some point stop trying to list every page on Wikipedia.

Upvotes: 6

Related Questions