Hani Goc
Hani Goc

Reputation: 2441

How can I format or normalize a wikipedia title in order to get its wikipedia page id (python)

Introduction

As input we have a wikipedia page title for which we want to extract its wikipedia page ID. For this purpose i am using the following python code:

#! /usr/bin/python
# -*- coding: utf-8 -*

import requests

if __name__ == "__main__": 
   url = "https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=Daniel cudmore businessman"
   result = requests.get(url).json()
   print result

Problem

I can't find the wikipedia page ids of the following titles:

1- daniel cudmore businessman

{u'batchcomplete': u'', u'query': {u'pages': {u'-1': {u'ns': 0, u'missing': u'', u'title': u'Daniel cudmore businessman'}}}}

The actual id of the page should be: ​37030093

In here the problem is that the used wikipedia page title is Daniel Cudmore (businessman) where as mine has the following form daniel cudmore businessman.


2- Prince david of georgia:

{u'batchcomplete': u'', u'query': {u'normalized': [{u'to': u'Prince david of georgia', u'from': u'prince david of georgia'}], u'pages': {u'-1': {u'ns': 0, u'missing': u'', u'title': u'Prince david of georgia'}}}}

The actual id of the page should be: 3443932

In here the title of the wikipedia page and the title that I used are the same. I can't find the problem.


On the DBpedia SPARQL endpoint:

SELECT ?id WHERE { 
     <http://dbpedia.org/resource/Daniel_Cudmore_(businessman)>  
     <http://dbpedia.org/ontology/wikiPageID> ?id}

SPARQL results

Upvotes: 1

Views: 305

Answers (1)

DomTomCat
DomTomCat

Reputation: 8569

In the latter example ("Prince_david_of_georgia"), you've got different character cases (compare with "Prince_David_of_Georgia"), so that particular page doesn't exist either, on Wikipedia

You could use the Special Search-URL: https://en.wikipedia.org/wiki/Special:Search/Prince_david_of_georgia to get the requested page and then retrieve the ID

or a list of suggestions: https://en.wikipedia.org/wiki/Special:Search/Daniel_Cudmore_businessman Which you can parse for the first entry. This will likely be your page. Do some string comparison without white space, braces etc to double check - then retrieve the ID as you did, already.

Upvotes: 1

Related Questions