Binyamin Even
Binyamin Even

Reputation: 3382

Getting the creation date of a page on Wikidata Query Service

I'm looking for a way to get all the movies on English Wikipedia, with their creation date. A movie for me is a page with IMDB ID attached to it. So, this is my query so far:

SELECT DISTINCT ?item_label ?imdb_id  (year(?dateCreation) as ?AnneeCreation) WHERE {
      ?item wdt:P345 $imdb_id.
      filter STRSTARTS(?imdb_id,"tt")
  OPTIONAL{
    ?item wdt:P571 ?dateCreation.           
  }

      SERVICE wikibase:label { bd:serviceParam wikibase:language "en".}  
      ?article schema:about ?item ; schema:isPartOf <https://en.wikipedia.org/> ; schema:name ?item_label
    }

The problem with this is that most of the pages don't have a P571 property, So I was wondering whether there is a better way to get the creation date? Maybe by the revision history or something, I couldn't find such an option.

Any help will be appreciated!

Upvotes: 6

Views: 1180

Answers (1)

Andrew is gone
Andrew is gone

Reputation: 296

So, as the comments have noted, Wikidata properties (with some rare examples like featured-article flags) describe the underlying concept, not the Wikipedia page metadata. There is some limited ability to talk to the Wikipedia API as @AKSW points out, but my understanding is that this doesn't work very well for large numbers of articles (note the example code has a LIMIT 50 in it)

However, all is not lost! I worked out a methodology to do this at scale for very large numbers of articles recently in Gender and Deletion on Wikipedia, using a bit of lateral thinking.

First step: figure out your Wikidata query. tt-prefixed IMDB tags may apply to things other than films (eg TV episodes, sports broadcasts), so another approach might be to do a P31/P279 type/class search to find all things that are "films, or subclasses of films". You will also want to add a filter that explicitly says "and only has an article in English Wikipedia", which I see you've already done. Note that this gives you the name of the WP article, not the "label" of the Wikidata item, which is distinct, so you can drop the (time-consuming) label service clause. You'll end up with something like https://w.wiki/FH4 (this still uses the tt- prefix approach and gets 180k results) or https://w.wiki/FH8 (P31/P279 filter plus tt- prefix, 136k results)

Run this query, save the results TSV somewhere, and move on to step 2. The tool we will use here is PetScan, which is designed to link up data from Wikipedia categories, Wikipedia metadata, Wikidata queries, etc.

Feed the SPARQL query into tab 4 ("Other sources") and say "Use wiki: enwiki" at the bottom of this tab. This will force it to output data on the Wikipedia articles linked from this query.

Now hit "do it", wait a little while, (it took ~100s when I tested it) and examine the results. You will see that we get title (the WP article), page ID, namespace (hopefully always "(Article)", size in bytes, and last-touched date. None of these are creation date...

...except one of them kind of is. PageIDs are assigned sequentially, so they are essentially time-of-creation timestamps. There are some nuances here about edge cases - eg if I created a redirect called "Example (film)" in 2010, and in 2015 manually edited the redirect to become a real article called "Example (film)", it would show up as created in 2010. There may also be odd results for pages deleted and recreated, or ones that have had complicated page-move histories (straightforward page moves should maintain IDs, though). But, in general, for 95% of items, the pageID will reflect the time at which it was first created onwiki. For example, 431900000 was created at 11.14am on 1 July 2014; 531900000 was created at 6.29pm on 14 February 2017; and so on.

Back to PetScan - let's pull down all these items. In PetScan, go to the last tab and select TSV. Re-run the search and save the resulting file.

Now, we have one TSV with Wikidata IDs, IMDB IDs, and WP page titles (plus anything else you want to recover from WD queries); we have another with WP page titles and page IDs. You can link them together using WP page titles, letting you go from "results in Wikidata" to "page ID". Clean these up and link them however you prefer - I did it in bash, you might want to use something more sensible like python.

Now you can convert PageID to creation date. For the work I did I was only interested in six-month bins so I just worked out an arbitrary pageID created on 1 January and 1 July each year, and counted IDs between them. You could do the same thing, or use the API to look up individual pageIDs and get creation timestamps back - depends exactly what you're wanting to get.

This is all a bit more complicated than just using the query service, and it will ''probably'' give spurious results for one or two articles with complicated histories, but it will basically let you do what you originally asked for.

Upvotes: 3

Related Questions