OmniOwl
OmniOwl

Reputation: 5709

How to get specific data from Wikipedia?

I only want to get data that is about Video Games (like Duke Nukem 3D, Atari Games, etc.). But looking at how they expect you to query and how others have done it, I can't quite wrap my head around how to do it. I've searched for a couple of days now on how to do it but have come out empty handed.

I also had a look at their API but how they want you to make a query string didn't really help me. I tried to do this:

https://en.wikipedia.org/w/api.php?action=query&prop=categories&format=json&titles=Video_Game

But it gave me this in return:

{
   "batchcomplete":"",
   "query":{
      "normalized":[
         {
            "from":"Video_Game",
            "to":"Video Game"
         }
      ],
      "pages":{
         "361741":{
            "pageid":361741,
            "ns":0,
            "title":"Video Game",
            "categories":[
               {
                  "ns":14,
                  "title":"Category:Redirects from other capitalisations"
               },
               {
                  "ns":14,
                  "title":"Category:Unprintworthy redirects"
               }
            ]
         }
      }
   }
}

I suspect it just found me the page for what a Video Game is but not all pages that are about Video Games. I might just not understand correctly how to get data from Wikipedia.

Any help?

Upvotes: 0

Views: 291

Answers (2)

Termininja
Termininja

Reputation: 7036

All pages in English Wikipedia about video games contain template called Infobox video game, so you just need to use Wikipedia API query with property transcludedin to get all them:

https://en.wikipedia.org/w/api.php?action=query&prop=transcludedin&tilimit=500&titles=Template:Infobox_video_game

Upvotes: 1

Tgr
Tgr

Reputation: 28160

For one thing, Video Game is a redirect to Video game (capitalization matters in Wikipedia, except for the very first character of the title) so it does not have much useful information. You can use the redirects=1 API parameter to automatically resolve redirects.

Another problem is that you are asking the API what categories the video game article is in. What you probably wanted is to ask what articles are in the video game category. That's something like action=query&list=categorymembers&cmtitle=Category%3AVideo+games.

The third problem is that categories form a graph, so usually most of the relevant content is in subcategories and not the main category itself. That is the case with Category:Video games as well. So even if you had retrieved the article list correctly, it wouldn't have been particularly useful.

There are various ways to get a more useful list of relevant articles.

  • You can query based on infobox inclusion as Termininja said. The problem with that is that it will miss articles which don't have the infobox (generally newer, less well-written ones).
  • You can use the experimental, standalone category graph search service to find all articles in the video games category and subcategories. In practice that can be dangerous with very generic categories such as "video games" because the category system is messy - it's not a proper tree, it contains loops and other weird things, so you might find that when going deep enough into subcategories the content is not even remotely related to video games. (For example video games > video game culture > nerd culture > anime and manga fandom.)
  • You can use SPARQL queries with Wikidata, such as is a: video game. This depends on the Wikidata information being properly maintained, which is not always the case.

You are probably best off using the infobox in this case.

Upvotes: 0

Related Questions