Wikimedia pageview compression not working

Question

I am trying to analyze monthly wikimedia pageview statistics. Their daily dumps are OK but monthly reports like the one from June 2021 (https://dumps.wikimedia.org/other/pageview_complete/monthly/2021/2021-06/pageviews-202106-user.bz2) seem broken:

[radim@sandbox2 pageviews]$ bzip2 -t pageviews-202106-user.bz2 
bzip2: pageviews-202106-user.bz2: bad magic number (file not created by bzip2)

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

[radim@sandbox2 pageviews]$ file pageviews-202106-user.bz2 
pageviews-202106-user.bz2: Par archive data

Any idea how to extract the data? What encoding is used here? Can it be Parquet file from their Hive analytics cluster?

Wikimedia pageview compression not working

Answers (1)

Related Questions