Reputation: 73
I am trying to find the wiki id of list of pages from wikipedia. So, the format is:
input: list of wikipedia page titles
output: list of wikipedia page ids.
So far, I've gone through Mediawiki API to understand how to proceed, but couldn't find a correct way to implement the function. Can anyone suggest how to get the list of page ids?
Upvotes: 3
Views: 2902
Reputation: 1573
Querying Wikipedia API for getting the mapping can be a bit time consuming given that there are some restrictions on its usage.
It would be better if you could download the Wikipedia dump and use wikiextractor for getting it into JSON format. Now, the key id
refers to Wikipedia page id and title
refers to the Wikipedia page title. So, in one go, we get the mapping for all the pages in Wikipedia!
Upvotes: 1
Reputation: 1300
The answer provided by AXO works as long as you don't have unnormalized titles such as a category page "Category:Computer_storage_devices" or special characters like &.
In that case you also need to map the response with the normalized titles as following:
def get_page_ids(page_titles):
import requests
from requests import utils
page_titles_encoded = [requests.utils.quote(x) for x in page_titles]
url = (
'https://en.wikipedia.org/w/api.php'
'?action=query'
'&prop=info'
'&inprop=subjectid'
'&titles=' + '|'.join(page_titles_encoded) +
'&format=json')
# print(url)
json_response = requests.get(url).json()
# print(json_response)
page_normalized_titles = {x:x for x in page_titles}
result = {}
if 'normalized' in json_response['query']:
for mapping in json_response['query']['normalized']:
page_normalized_titles[mapping['to']] = mapping['from']
for page_id, page_info in json_response['query']['pages'].items():
normalized_title = page_info['title']
page_title = page_normalized_titles[normalized_title]
result[page_title] = page_id
return result
get_page_ids(page_titles = ['Category:R&J_Records_artists', 'Category:Computer_storage_devices', 'Category:Main_topic_classifications'])
will print
{'Category:R&J_Records_artists': '33352333', 'Category:Computer_storage_devices': '895945', 'Category:Main_topic_classifications': '7345184'}
.
Upvotes: 0
Reputation: 9086
import requests
page_titles = ['A', 'B', 'C', 'D']
url = (
'https://en.wikipedia.org/w/api.php'
'?action=query'
'&prop=info'
'&inprop=subjectid'
'&titles=' + '|'.join(page_titles) +
'&format=json')
json_response = requests.get(url).json()
title_to_page_id = {
page_info['title']: page_id
for page_id, page_info in json_response['query']['pages'].items()}
print(title_to_page_id)
print([title_to_page_id[title] for title in page_titles])
This will print:
{'A': '290', 'B': '34635826', 'C': '5200013', 'D': '8123'}
['290', '34635826', '5200013', '8123']
If you have too many titles, you have to query for them in multiple requests because there is a 50 (500 for bots) limit for the number of titles that can be queried at once.
Upvotes: 5