Kenzo
Kenzo

Reputation: 357

Where / how to get the top 10,000 Wikipedia article titles, by pageview?

I would like to get the top ~10,000 Wikipedia article titles by page views in the English Wiki project.

I do not need the pageviews to come with the data. I just need to know that I have the top 10,000 article titles.

A list of the top 10,000 would be great, as I can use that to scrape. A JSON of the top X would be even better!

Topviews and Massviews have been a great resource, and are oh-so-close to what I'm looking for!

Topviews however limits the list to 490, and Massviews requires a search term. I would like the most popular Wiki articles across the whole English project.

I am open to data dumps, APIs, or any other existing tool. Appreciate the help Wikis.

Upvotes: 0

Views: 1228

Answers (1)

Kenzo
Kenzo

Reputation: 357

Here is the response to my above question by the creator of the Massviews/Topviews tool, the wonderful Mr. Leon Ziemba:

I'm not sure what you mean by "all categories". Do you mean all articles, across all of a project? There is https://tools.wmflabs.org/topviews, if that helps.

Or do you mean you want to give Massviews several categories at once? If so, a workaround would be to use a combination of Petscan, Page Pile, then Massviews: * Go to https://petscan.wmflabs.org/ add add your categories, selecting "union" as the "Combination", then hit "Do it!". * Click on the "Output" tab at the top-right, select "PagePile" as the Format. Other options can probably left as-is. Click "Do It!" once more. * You should now be on the PagePile. At the top-left it will say "Pile 123", where 123 is the pile number. Take note of this. * Go back to Massviews. Select "Page Pile" as the source, and put in the pile number. * Profit!

In Massviews, there is an option "Include all subcategories". Maybe that would help you here. However for performance reasons, you'll get on more than 20,000 results.

If you need the 10,000 pages by pageviews, for all pages across all of the English Wikipedia, this will have to be manually computed using the raw datasets. It would not be feasible for a tool to go through every single Wikipedia article in real time. The raw dataset dumps can be found at https://dumps.wikimedia.org/other/pageviews/.

Upvotes: 1

Related Questions