wiki dump encoding

Question

I'm using WikiPrep to process the latest wiki dump enwiki-20121101-pages-articles.xml.bz2. Instead of "use Parse::MediaWikiDump;" I replaced that by "use MediaWiki::DumpFile::Compat;" and did the proper changes in the code. Then, I ran

perl wikiprep.pl -f enwiki-20121101-pages-articles.xml.bz2

I got an error

enwiki-20121101-pages-articles.xml.bz2:1: parser error : Document is empty
BZh91AY&SY±H¦ÂOÿ~Ð`ÿÿÿ¿ÿÿÿ¿ÿÿÿÿÿÿÿÿÿÿ½ÿýþdß8õEnÞ¶zëJ¨Eà®mEÓP|f÷Ô
^

I guess there are some non-utf8 characters contained in the dump. So I ran

iconv -f utf8 -t utf8 enwiki-20121101-pages-articles.xml.bz2

And indeed, I got some errors

BZh91AY&SYiconv: illegal input sequence at position 10

So, my question is what's the encoding format of wiki dump and if I wish to convert it to utf-8, what shall I do? Or how should modify wikiprep.pl to avoid such problems.

Many thanks

-- [solved] I should first unzip the file first.

wiki dump encoding

Answers (1)

Related Questions