Léo Joubert
Léo Joubert

Reputation: 542

Getting total page view from (french) Wikipedia by page

I am searching for the total pageview (from july 2015, release date of PageViews API, to 1rst January of 2019) of any page of french Wikipedia project.

Using PageViews API (How to use Wikipedia API to get the page view statistics of a particular page in wikipedia?) seems ways too heavy to me : I need data from over 2 millions pages.

Using MassViews (https://tools.wmflabs.org/massviews/) with a query returning all pages titles (https://quarry.wmflabs.org/query/34473) do not work either : MassView suffer from a 20000 pages limitation, and fail to retrieve data for some pages titles from my query results.

Do you know some more efficient tools to do this ?

Upvotes: 0

Views: 798

Answers (3)

Timothy Wu
Timothy Wu

Reputation: 166

Wikipedia's API is powerful, like this can get the pageview of Apollo_10 of french wikipedia. Make a script based on this is not so hard.

If you think using API to query all the sites is heavy, you can use google bigquery. It has pageview data in its open dataset. There has a tutorial about this.

Here is my example:

  1. Access bigqery's console.
  2. Type the content below in the answer.
select * from `bigquery-public-data.wikipedia.pageviews_2015` where datehour = '2015-07-12 18:00:00 UTC';
  1. And you will get a table that contains all the pageview data at this time.

If you want to get specific page of french wiki, you may specify 'wiki=fr' and 'title = xxx'. As I'm new in bigquery, I don't know how to query data cross the table and export. But that's possible based on my poor knowledge in SQL. You can aggregate the data by title and export the result.

The only problem is that bigquery is not free. For example, the query above cost 6GB. Querys (on-demand) is free for the first 1 TB and 5 dollars per TB after. Bigquery will charge according to the data processed in the columns you select, even if you use a 'limit'. So it may cost a lot.

Upvotes: 1

Léo Joubert
Léo Joubert

Reputation: 542

Found this : https://dumps.wikimedia.org/other/pagecounts-ez/merged/ which is a merged of page views dumps. Documented here : https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews

Here is an example of a Python script who print trivially each line of the one of the file.

import csv
import bz2
from pprint import pprint

with bz2.open("pagecounts-2011-12-views-ge-5-totals.bz2", "rt", errors = "replace") as fichier:
    for line in fichier:
        text = line.split()
        if(text[0] == "fr"):
            pprint(text)

With this kinf of files, one per month, it became easy to set up this kind of workflow : filter the liens I really want (french wiki), LOAD DATA INFILE into MySQL database, and querying it with Python again.

Upvotes: 0

smartse
smartse

Reputation: 1721

You can download dumps of all pageviews from here: https://dumps.wikimedia.org/other/pageviews/

Upvotes: 1

Related Questions