user877329
user877329

Reputation: 6220

Get body text from Wikipedia dump

I want to do some text analysis so I have downloaded a dump of Wikipedia articles. The file is a huge XML file with wikitext inside an xml tag. After filtering with expat I still see some XML stuff:

<ref name="Princeton">Buswell & Lopez (2014) uppslagsord: sang rgyas.</ref>

Together with wiki-markup. I guess next step would be to pass each article through a wiki-parser. I would like the application-level API to look like

std::string get_body_text(std::string_view wikitext);

So I can print the filtered text in the expat callback. Should I pipe wikitext to pandoc or try to find a C++ parser for MediaWiki format?

I tried

cat ~/Skrivbord/svwiki-latest-pages-articles.xml | __targets/wikifilter | pandoc --from MediaWiki

But my machine does not have sufficient amount of RAM for that to work. I guess pandoc is DOM-like and not SAX-like, or maybe Haskell is not good at conserving memory.

Update: I can get reasonable performance if I push chunks multiple articles (not all at once) through pandoc. Now the problem is that I have to get rid of all template references. For my usecase, it is probably best to have templates replaced by empty string.

Upvotes: 0

Views: 363

Answers (0)

Related Questions