Get body text from Wikipedia dump

Question

I want to do some text analysis so I have downloaded a dump of Wikipedia articles. The file is a huge XML file with wikitext inside an xml tag. After filtering with expat I still see some XML stuff:

Buswell & Lopez (2014) uppslagsord: sang rgyas.

Together with wiki-markup. I guess next step would be to pass each article through a wiki-parser. I would like the application-level API to look like

std::string get_body_text(std::string_view wikitext);

So I can print the filtered text in the expat callback. Should I pipe wikitext to pandoc or try to find a C++ parser for MediaWiki format?

I tried

cat ~/Skrivbord/svwiki-latest-pages-articles.xml | __targets/wikifilter | pandoc --from MediaWiki

But my machine does not have sufficient amount of RAM for that to work. I guess pandoc is DOM-like and not SAX-like, or maybe Haskell is not good at conserving memory.

Update: I can get reasonable performance if I push chunks multiple articles (not all at once) through pandoc. Now the problem is that I have to get rid of all template references. For my usecase, it is probably best to have templates replaced by empty string.

Get body text from Wikipedia dump

Answers (0)

Related Questions