Reputation: 542
I am searching for the total pageview (from july 2015, release date of PageViews API, to 1rst January of 2019) of any page of french Wikipedia project.
Using PageViews API (How to use Wikipedia API to get the page view statistics of a particular page in wikipedia?) seems ways too heavy to me : I need data from over 2 millions pages.
Using MassViews (https://tools.wmflabs.org/massviews/) with a query returning all pages titles (https://quarry.wmflabs.org/query/34473) do not work either : MassView suffer from a 20000 pages limitation, and fail to retrieve data for some pages titles from my query results.
Do you know some more efficient tools to do this ?
Upvotes: 0
Views: 798
Reputation: 166
Wikipedia's API is powerful, like this can get the pageview of Apollo_10 of french wikipedia. Make a script based on this is not so hard.
If you think using API to query all the sites is heavy, you can use google bigquery. It has pageview data in its open dataset. There has a tutorial about this.
Here is my example:
select * from `bigquery-public-data.wikipedia.pageviews_2015` where datehour = '2015-07-12 18:00:00 UTC';
If you want to get specific page of french wiki, you may specify 'wiki=fr' and 'title = xxx'. As I'm new in bigquery, I don't know how to query data cross the table and export. But that's possible based on my poor knowledge in SQL. You can aggregate the data by title and export the result.
The only problem is that bigquery is not free. For example, the query above cost 6GB. Querys (on-demand) is free for the first 1 TB and 5 dollars per TB after. Bigquery will charge according to the data processed in the columns you select, even if you use a 'limit'. So it may cost a lot.
Upvotes: 1
Reputation: 542
Found this : https://dumps.wikimedia.org/other/pagecounts-ez/merged/ which is a merged of page views dumps. Documented here : https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews
Here is an example of a Python script who print trivially each line of the one of the file.
import csv
import bz2
from pprint import pprint
with bz2.open("pagecounts-2011-12-views-ge-5-totals.bz2", "rt", errors = "replace") as fichier:
for line in fichier:
text = line.split()
if(text[0] == "fr"):
pprint(text)
With this kinf of files, one per month, it became easy to set up this kind of workflow : filter the liens I really want (french wiki), LOAD DATA INFILE into MySQL database, and querying it with Python again.
Upvotes: 0
Reputation: 1721
You can download dumps of all pageviews from here: https://dumps.wikimedia.org/other/pageviews/
Upvotes: 1