Reputation: 300
I'm using WikiPrep to process the latest wiki dump enwiki-20121101-pages-articles.xml.bz2. Instead of "use Parse::MediaWikiDump;" I replaced that by "use MediaWiki::DumpFile::Compat;" and did the proper changes in the code. Then, I ran
perl wikiprep.pl -f enwiki-20121101-pages-articles.xml.bz2
I got an error
enwiki-20121101-pages-articles.xml.bz2:1: parser error : Document is empty
BZh91AY&SY±H¦ÂOÿ~Ð`ÿÿÿ¿ÿÿÿ¿ÿÿÿÿÿÿÿÿÿÿ½ÿýþdß8õEnÞ¶zëJ¨Eà®mEÓP|f÷Ô
^
I guess there are some non-utf8 characters contained in the dump. So I ran
iconv -f utf8 -t utf8 enwiki-20121101-pages-articles.xml.bz2
And indeed, I got some errors
BZh91AY&SYiconv: illegal input sequence at position 10
So, my question is what's the encoding format of wiki dump and if I wish to convert it to utf-8, what shall I do? Or how should modify wikiprep.pl to avoid such problems.
Many thanks
-- [solved] I should first unzip the file first.
Upvotes: 2
Views: 1218
Reputation: 2544
You are running iconv on the compressed (bz2) version of the file, rather than the XML file itself. Uncompress it first.
(Posting borrible's answer so that this resolved question is not listed as unanswered.)
Upvotes: 1