Reputation: 4608
I recently found that wikipedia has Wikiprojects
that are categorised based on discipline
(https://en.wikipedia.org/wiki/Category:WikiProjects_by_discipline). As shown in the link it has 34 disciplines.
I would like to know if it is possible to get all the wikipedia articles that is related to each of these wikipedia disciplines
.
For example, consider WikiProject Computer science
. Is it possible to get all the computer science related wikipedia articles using WikiProject Computer science
category? If so, are there any data dumps related to it or is there any other way to obtain these data?
I am currently using python (i.e. pywikibot
and pymediawiki
). However, I am happy to receive answers in other languages as well.
I am happy to provide more details if needed.
Upvotes: 2
Views: 947
Reputation: 86
Came across this page in my google results, am leaving some working code here for posterity. This will interact with Wikipedia's api directly, won't use pywikibot or pymediawiki.
Getting the article names is a 2-step process. Because the members of a category are not the articles themselves, but their talk pages. So first we get the talk pages, and then we have to get the parent pages, the actual articles.
(For more info on the parameters used in the API requests, check the pages for querying category members, and querying page info.)
import time
import requests
from datetime import datetime,timezone
import json
utc_time_now = datetime.now(timezone.utc)
utc_time_now_string =\
utc_time_now.replace(microsecond=0).replace(tzinfo=None).isoformat() + 'Z'
api_url = 'https://en.wikipedia.org/w/api.php'
headers = {'User-Agent': '<Your purpose>, owner_name: <Your name>,
email_id: <Your email id>'}
# or you can follow instructions at
# https://www.mediawiki.org/wiki/API:Etiquette#The_User-Agent_header
category = "Category:WikiProject_Computer_science_articles"
combined_category_members = []
params = {
'action': 'query',
'format': 'json',
'list':'categorymembers',
'cmtitle': category,
'cmprop': 'ids|title|timestamp',
'cmlimit': 500,
'cmstart': utc_time_now_string,
# you can also put a 'cmend': '20210101000000'
# (that YYYYMMDDHHMMSS string stands for 12 am UTC on Nov 1, 2021)
# this then gathers category members added from now till value for 'cmend'
'cmdir': 'older',
'cmnamespace': '0|1',
'cmsort': 'timestamp'
}
response = requests.get(api_url, headers=headers, params=params)
data = response.json()
category_members = data['query']['categorymembers']
combined_category_members.extend(category_members)
while 'continue' in data:
params.update(data['continue'])
time.sleep(1)
response = requests.get(api_url, headers=headers, params=params)
data = response.json()
category_members = data['query']['categorymembers']
combined_category_members.extend(category_members)
#now we've gotten only the talk page ids so far
#now we have to get the parent page ids from talk page ids
final_dict = {}
talk_page_id_list = []
for member in combined_category_members:
talk_page_id = member['pageid']
talk_page_id_list.append(talk_page_id)
while talk_page_id_list: #while not an empty list
fifty_pageid_batch = talk_page_id_list[0:50]
fifty_pageid_batch_converted = [str(number) for number in fifty_pageid_batch]
fifty_pageid_string = '|'.join(fifty_pageid_batch_converted)
params = {
'action': 'query',
'format': 'json',
'prop': 'info',
'pageids': fifty_pageid_string,
'inprop': 'subjectid|associatedpage'
}
time.sleep(1)
response = requests.get(api_url, headers=headers, params=params)
data = response.json()
for talk_page_id, talk_page_id_dict in data['query']['pages'].items():
page_id_raw = talk_page_id_dict['subjectid']
page_id = str(page_id_raw)
page_title = talk_page_id_dict['associatedpage']
final_dict[page_id] = page_title
del talk_page_id_list[0:50]
with open('comp_sci_category_members.json', 'w', encoding='utf-8') as filex:
json.dump(final_dict, filex, ensure_ascii=False)
Upvotes: 0
Reputation: 1689
As I suggested and adding to @arash's answer, you can use the Wikipedia API to get the Wikipedia data. Here is the link with the description about how to do that, API:Categorymembers#GET_request
As you commented that you need to fetch the data using program, below is the sample code in JavaScript. It will fetch the first 500 names from Category:WikiProject_Computer_science_articles
and displays as output. You can convert the language of your choice based on this example:
// Importing the module
const fetch = require('node-fetch');
// URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";
// Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
// Getting the length of the returned array
let len = t.query.categorymembers.length;
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
console.log(t.query.categorymembers[i].title);
}
});
To write the data into a file, you can do like below :
//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');
//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";
//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
// Getting the length of the returned array
let len = t.query.categorymembers.length;
// Initializing an empty array
let titles = [];
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
let title = t.query.categorymembers[i].title;
console.log(title);
titles[i] = title;
}
fs.writeFileSync('pathtotitles\\titles.txt', titles);
});
The above one will store the data in a file with ,
separated because we using the JavaScript Array there. If you want to store in each line without commas then you need to do like this:
//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');
//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";
//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
// Getting the length of the returned array
let len = t.query.categorymembers.length;
// Initializing an empty array
let titles = '';
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
let title = t.query.categorymembers[i].title;
console.log(title);
titles += title + "\n";
}
fs.writeFileSync('pathtotitles\\titles.txt', titles);
});
By using the cmlimit
, we can't fetch more than 500 titles so we need to use cmcontinue
for checking and fetching the next pages...
Try the below code which fetches all the titles of a particular category and prints, appends data to a file :
//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');
//URL with resources to fetch
var url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmlimit=500";
// Method to fetch and append the data to a file
var fetchTheData = async (url, index) => {
return await fetch(url).then(res => res.json()).then(data => {
// Getting the length of the returned array
let len = data.query.categorymembers.length;
// Initializing an empty string
let titles = '';
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
let title = data.query.categorymembers[i].title;
console.log(title);
titles += title + "\n";
}
// Appending to the file
fs.appendFileSync('pathtotitles\\titles.txt', titles);
// Handling an end of error fetching titles exception
try {
return data.continue.cmcontinue;
} catch(err) {
return "===>>> Finished Fetching...";
}
});
}
// Method which will construct the next URL with next page to fetch the data
var constructNextPageURL = async (url) => {
// Getting the next page token
let nextPage = await fetchTheData(url);
for(let i=1;i<=14;i++) {
await console.log("=> The next page URL is : "+(url + '&cmcontinue=' + nextPage));
// Constructing the next page URL with next page token and sending the fetch request
nextPage = await fetchTheData(url + '&cmcontinue=' + nextPage);
}
}
// Calling to begin extraction
constructNextPageURL(url);
I hope it helps...
Upvotes: 3
Reputation: 154
You can use API:Categorymembers to get the list of sub categories and pages. set "cmtype" parameter to "subcat" to get subcategories and "cmnamespace" to "0" to get articles.
Also you can get the list from database (category hierarchy information in categorylinks table and article information in page table)
Upvotes: 2