Working with large *.bz2 (Wikipedia dump)

Question

I need to get the number of daily pageviews of the English Wikipedia article on "Dollar" and "Euro" from 06/2012-06/2016.

Raw dumps (*.bz2) are available at: https://dumps.wikimedia.org/other/pagecounts-ez/merged/

For example, https://dumps.wikimedia.org/other/pagecounts-ez/merged/pagecounts-2014-01-views-ge-5-totals.bz2 provides hourly/daily data for January 2014.

Problem: The unzipped files are too big to be opened in any text editor.

Desired solution: A Python script (?) that reads each of the .bz2 files, searches for the en wikipedia "Dollar" / "Euro" entry only and puts the daily pageviews into a Data Frame.

Hint: Using the Pageviews API (https://wikitech.wikimedia.org/wiki/Pageviews_API) won't be helpful as I'll need consistent data before 2015. stats.grok data (http://stats.grok.se/) is neither an option, as the generated data is different and incompatible.

Ilmari Karonen · Accepted Answer

Probably the simplest solution would be to write your search script to read line by line from standard input (sys.stdin in Python; of course there's a Stack Overflow question about that too) and then piping the output of bzcat to it:

$ bzcat pagecounts-2014-01-views-ge-5-totals.bz2 | python my_search.py

Just make sure that your Python code indeed processes the input incrementally, rather than trying to buffer the entire input in memory at once.

This way, there's no need to complicate your Python script itself with any bzip2 specific code.

(This may also be faster than trying to do the bzip2 decoding in Python anyway, since the bzcat process can run in parallel with the search script.)

Working with large *.bz2 (Wikipedia dump)

Answers (1)

Related Questions