AllSmiles
AllSmiles

Reputation: 73

How to get page id from wikipedia page title

I am trying to find the wiki id of list of pages from wikipedia. So, the format is:

input: list of wikipedia page titles

output: list of wikipedia page ids.

So far, I've gone through Mediawiki API to understand how to proceed, but couldn't find a correct way to implement the function. Can anyone suggest how to get the list of page ids?

Upvotes: 3

Views: 2902

Answers (3)

sv_jan5
sv_jan5

Reputation: 1573

Querying Wikipedia API for getting the mapping can be a bit time consuming given that there are some restrictions on its usage.

It would be better if you could download the Wikipedia dump and use wikiextractor for getting it into JSON format. Now, the key id refers to Wikipedia page id and title refers to the Wikipedia page title. So, in one go, we get the mapping for all the pages in Wikipedia!

Upvotes: 1

Gianmario Spacagna
Gianmario Spacagna

Reputation: 1300

The answer provided by AXO works as long as you don't have unnormalized titles such as a category page "Category:Computer_storage_devices" or special characters like &.

In that case you also need to map the response with the normalized titles as following:

def get_page_ids(page_titles):
    import requests
    from requests import utils

    page_titles_encoded = [requests.utils.quote(x) for x in page_titles]

    url = (
        'https://en.wikipedia.org/w/api.php'
        '?action=query'
        '&prop=info'
        '&inprop=subjectid'
        '&titles=' + '|'.join(page_titles_encoded) +
        '&format=json')
    # print(url)
    json_response = requests.get(url).json()
    # print(json_response)

    page_normalized_titles = {x:x for x in page_titles}
    result = {}
    if 'normalized' in json_response['query']:
        for mapping in json_response['query']['normalized']:
            page_normalized_titles[mapping['to']] = mapping['from']

    for page_id, page_info in json_response['query']['pages'].items():
        normalized_title = page_info['title']
        page_title = page_normalized_titles[normalized_title]  
        result[page_title] = page_id

    return result


get_page_ids(page_titles = ['Category:R&J_Records_artists', 'Category:Computer_storage_devices', 'Category:Main_topic_classifications'])

will print

{'Category:R&J_Records_artists': '33352333', 'Category:Computer_storage_devices': '895945', 'Category:Main_topic_classifications': '7345184'}.

Upvotes: 0

AXO
AXO

Reputation: 9086

Query basic page information:

import requests

page_titles = ['A', 'B', 'C', 'D']
url = (
    'https://en.wikipedia.org/w/api.php'
    '?action=query'
    '&prop=info'
    '&inprop=subjectid'
    '&titles=' + '|'.join(page_titles) +
    '&format=json')
json_response = requests.get(url).json()

title_to_page_id  = {
    page_info['title']: page_id
    for page_id, page_info in json_response['query']['pages'].items()}

print(title_to_page_id)
print([title_to_page_id[title] for title in page_titles])

This will print:

{'A': '290', 'B': '34635826', 'C': '5200013', 'D': '8123'}
['290', '34635826', '5200013', '8123']

If you have too many titles, you have to query for them in multiple requests because there is a 50 (500 for bots) limit for the number of titles that can be queried at once.

Upvotes: 5

Related Questions