EmJ
EmJ

Reputation: 4608

How to get wikipedia data of Wikiprojects?

I recently found that wikipedia has Wikiprojects that are categorised based on discipline (https://en.wikipedia.org/wiki/Category:WikiProjects_by_discipline). As shown in the link it has 34 disciplines.

I would like to know if it is possible to get all the wikipedia articles that is related to each of these wikipedia disciplines.

For example, consider WikiProject Computer science‎. Is it possible to get all the computer science related wikipedia articles using WikiProject Computer science‎ category? If so, are there any data dumps related to it or is there any other way to obtain these data?

I am currently using python (i.e. pywikibot and pymediawiki). However, I am happy to receive answers in other languages as well.

I am happy to provide more details if needed.

Upvotes: 2

Views: 947

Answers (3)

Shijith Kunhitty
Shijith Kunhitty

Reputation: 86

Came across this page in my google results, am leaving some working code here for posterity. This will interact with Wikipedia's api directly, won't use pywikibot or pymediawiki.

Getting the article names is a 2-step process. Because the members of a category are not the articles themselves, but their talk pages. So first we get the talk pages, and then we have to get the parent pages, the actual articles.

(For more info on the parameters used in the API requests, check the pages for querying category members, and querying page info.)

import time
import requests
from datetime import datetime,timezone
import json

utc_time_now = datetime.now(timezone.utc)
utc_time_now_string =\
utc_time_now.replace(microsecond=0).replace(tzinfo=None).isoformat() + 'Z'

api_url = 'https://en.wikipedia.org/w/api.php'
headers = {'User-Agent': '<Your purpose>, owner_name: <Your name>, 
          email_id: <Your email id>'}
        # or you can follow instructions at 
        # https://www.mediawiki.org/wiki/API:Etiquette#The_User-Agent_header

category = "Category:WikiProject_Computer_science_articles"

combined_category_members = []

params = {
        'action': 'query',
        'format': 'json',
        'list':'categorymembers',
        'cmtitle': category,
        'cmprop': 'ids|title|timestamp',
        'cmlimit': 500,
        'cmstart': utc_time_now_string,
        # you can also put a 'cmend': '20210101000000' 
        # (that YYYYMMDDHHMMSS string stands for 12 am UTC on Nov 1, 2021)
        # this then gathers category members added from now till value for 'cmend'
        'cmdir': 'older',
        'cmnamespace': '0|1',
        'cmsort': 'timestamp'
}

response = requests.get(api_url, headers=headers, params=params)
data = response.json()
category_members = data['query']['categorymembers']
combined_category_members.extend(category_members)

while 'continue' in data:
    params.update(data['continue'])
    time.sleep(1)
    response = requests.get(api_url, headers=headers, params=params)
    data = response.json()
    category_members = data['query']['categorymembers']
    combined_category_members.extend(category_members)

#now we've gotten only the talk page ids so far
#now we have to get the parent page ids from talk page ids

final_dict = {}

talk_page_id_list = []
for member in combined_category_members:
    talk_page_id = member['pageid']
    talk_page_id_list.append(talk_page_id)

while talk_page_id_list: #while not an empty list
    fifty_pageid_batch = talk_page_id_list[0:50]
    fifty_pageid_batch_converted = [str(number) for number in fifty_pageid_batch]
    fifty_pageid_string = '|'.join(fifty_pageid_batch_converted)
    params = {
            'action':   'query',
            'format':   'json',
            'prop':     'info',
            'pageids':  fifty_pageid_string,
            'inprop': 'subjectid|associatedpage'
            }
    time.sleep(1)
    response = requests.get(api_url, headers=headers, params=params)
    data = response.json()
    for talk_page_id, talk_page_id_dict in data['query']['pages'].items():
        page_id_raw = talk_page_id_dict['subjectid']
        page_id = str(page_id_raw)
        page_title = talk_page_id_dict['associatedpage']
        final_dict[page_id] = page_title

    del talk_page_id_list[0:50] 

with open('comp_sci_category_members.json', 'w', encoding='utf-8') as filex:
    json.dump(final_dict, filex, ensure_ascii=False)

Upvotes: 0

Ali
Ali

Reputation: 1689

As I suggested and adding to @arash's answer, you can use the Wikipedia API to get the Wikipedia data. Here is the link with the description about how to do that, API:Categorymembers#GET_request

As you commented that you need to fetch the data using program, below is the sample code in JavaScript. It will fetch the first 500 names from Category:WikiProject_Computer_science_articles and displays as output. You can convert the language of your choice based on this example:

// Importing the module
const fetch = require('node-fetch');

// URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";

// Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
    // Getting the length of the returned array
    let len = t.query.categorymembers.length;
    // Iterating over all the response data
    for(let i=0;i<len;i++) {
        // Printing the names
        console.log(t.query.categorymembers[i].title);
    }
});

To write the data into a file, you can do like below :

//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');

//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";

//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
    // Getting the length of the returned array
    let len = t.query.categorymembers.length;
    // Initializing an empty array
    let titles = [];
    // Iterating over all the response data
    for(let i=0;i<len;i++) {
        // Printing the names
        let title = t.query.categorymembers[i].title;
        console.log(title);
        titles[i] = title;
    }
    fs.writeFileSync('pathtotitles\\titles.txt', titles);
});

The above one will store the data in a file with , separated because we using the JavaScript Array there. If you want to store in each line without commas then you need to do like this:

//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');

//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";

//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
    // Getting the length of the returned array
    let len = t.query.categorymembers.length;
    // Initializing an empty array
    let titles = '';
    // Iterating over all the response data
    for(let i=0;i<len;i++) {
        // Printing the names
        let title = t.query.categorymembers[i].title;
        console.log(title);
        titles += title + "\n";
    }
    fs.writeFileSync('pathtotitles\\titles.txt', titles);
});

By using the cmlimit, we can't fetch more than 500 titles so we need to use cmcontinue for checking and fetching the next pages...

Try the below code which fetches all the titles of a particular category and prints, appends data to a file :

//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');
//URL with resources to fetch
var url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmlimit=500";

// Method to fetch and append the data to a file 
var fetchTheData = async (url, index) => {
    return await fetch(url).then(res => res.json()).then(data => {
        // Getting the length of the returned array
        let len = data.query.categorymembers.length;
        // Initializing an empty string
        let titles = '';
        // Iterating over all the response data
        for(let i=0;i<len;i++) {
            // Printing the names
            let title = data.query.categorymembers[i].title;
            console.log(title);
            titles += title + "\n";
        }
        // Appending to the file
        fs.appendFileSync('pathtotitles\\titles.txt', titles);
        // Handling an end of error fetching titles exception
        try {
            return data.continue.cmcontinue;
        } catch(err) {
            return "===>>> Finished Fetching...";
        }
    });
}

// Method which will construct the next URL with next page to fetch the data
var constructNextPageURL = async (url) => {
    // Getting the next page token
    let nextPage = await fetchTheData(url);
    for(let i=1;i<=14;i++) {
        await console.log("=> The next page URL is : "+(url + '&cmcontinue=' + nextPage));
        // Constructing the next page URL with next page token and sending the fetch request
        nextPage = await fetchTheData(url + '&cmcontinue=' + nextPage);
    }
}

// Calling to begin extraction
constructNextPageURL(url);

I hope it helps...

Upvotes: 3

Arash
Arash

Reputation: 154

You can use API:Categorymembers to get the list of sub categories and pages. set "cmtype" parameter to "subcat" to get subcategories and "cmnamespace" to "0" to get articles.

Also you can get the list from database (category hierarchy information in categorylinks table and article information in page table)

Upvotes: 2

Related Questions