Reputation: 6220
I want to do some text analysis so I have downloaded a dump of Wikipedia articles. The file is a huge XML file with wikitext inside an xml tag. After filtering with expat I still see some XML stuff:
<ref name="Princeton">Buswell & Lopez (2014) uppslagsord: sang rgyas.</ref>
Together with wiki-markup. I guess next step would be to pass each article through a wiki-parser. I would like the application-level API to look like
std::string get_body_text(std::string_view wikitext);
So I can print the filtered text in the expat callback. Should I pipe wikitext to pandoc or try to find a C++ parser for MediaWiki format?
I tried
cat ~/Skrivbord/svwiki-latest-pages-articles.xml | __targets/wikifilter | pandoc --from MediaWiki
But my machine does not have sufficient amount of RAM for that to work. I guess pandoc is DOM-like and not SAX-like, or maybe Haskell is not good at conserving memory.
Update: I can get reasonable performance if I push chunks multiple articles (not all at once) through pandoc. Now the problem is that I have to get rid of all template references. For my usecase, it is probably best to have templates replaced by empty string.
Upvotes: 0
Views: 363