Reputation: 31
I have seen that there are various APIs and various tools that allow you to see the most visited pages of the Wikimedia projects such as Wikipedia, but all these services have a limit, they do not allow to show more than 1000 pages, while I would like to have the list of 5000-10000(or more) most visited pages in order of traffic.
these are all the services that I checked and with which I found this limit:
https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bmostviewed
https://wikimedia.org/api/rest_v1/#/Pageviews%20data
I have also found services like https://quarry.wmflabs.org/ or https://query.wikidata.org/ where you can run a query, technically perhaps through this service you could but I don't know the query to be performed to show the pages with most visits.
I also found an interesting article here: https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikipedia_pageviews_in_5/ where it is explained that it is possible to use Google's BigQuery but it is an external service and before using it I wanted to know if it existed a simpler method.
Upvotes: 0
Views: 1182
Reputation: 436
If the REST API doesn't suit your purpose, you'd need to parse the raw data yourself. That's because all the tools you've linked just consume the REST API.
The raw data are available at https://dumps.wikimedia.org/other/pageviews/. There are two groups of files there. One starts with pageviews-
, which lists the number of views of individual pages, the second starts with projectviews-
, which lists the number of views of individual projects.
For your target, you need the pageviews ones. Download the files for your timespan, and then analyze them using a script.
The file is space-separated. Each row represents one page that was visited in that hour. First column represents the project (en is English Wikipedia, for instance), second is the page title (spaces are represented by underscores) and then there are total pageviews.
The technical documentation is available at https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews.
Upvotes: 2