xuan
xuan

Reputation: 300

wiki dump encoding

I'm using WikiPrep to process the latest wiki dump enwiki-20121101-pages-articles.xml.bz2. Instead of "use Parse::MediaWikiDump;" I replaced that by "use MediaWiki::DumpFile::Compat;" and did the proper changes in the code. Then, I ran

perl wikiprep.pl -f enwiki-20121101-pages-articles.xml.bz2

I got an error

enwiki-20121101-pages-articles.xml.bz2:1: parser error : Document is empty
BZh91AY&SY±H¦ÂOÿ~Ð`ÿÿÿ¿ÿÿÿ¿ÿÿÿÿÿÿÿÿÿÿ½ÿýþdß8õEnÞ¶zëJ¨Eà®mEÓP|f÷Ô
^

I guess there are some non-utf8 characters contained in the dump. So I ran

iconv -f utf8 -t utf8 enwiki-20121101-pages-articles.xml.bz2

And indeed, I got some errors

BZh91AY&SYiconv: illegal input sequence at position 10

So, my question is what's the encoding format of wiki dump and if I wish to convert it to utf-8, what shall I do? Or how should modify wikiprep.pl to avoid such problems.

Many thanks

-- [solved] I should first unzip the file first.

Upvotes: 2

Views: 1218

Answers (1)

Nemo
Nemo

Reputation: 2544

You are running iconv on the compressed (bz2) version of the file, rather than the XML file itself. Uncompress it first.

(Posting borrible's answer so that this resolved question is not listed as unanswered.)

Upvotes: 1

Related Questions