Yamen Ajjour
Yamen Ajjour

Reputation: 1432

Parse Wikipedia dump into plain text where preserving the structure (sections)

I have been searching for wikipedia dump parser into customizable xml , basically each article should be parsed into a set of section tags , containing the section plain text of the article. I come up with the following solutions

The problem with the first one is that it is available only on windows and the second doesn't give the capability of producing the sections in a nested xml scheme . Previous implementations of mwlib seems to provide such capabilities but sadly new versions are not . Is there any wikipedia xml dump parser on linux which can produce customizable xmls ?

Upvotes: 0

Views: 1165

Answers (1)

David Przybilla
David Przybilla

Reputation: 828

I think this is doable using jsonwikipedia [1]. which generates a "json dump" out of the Wikipedia XML dump. more details on jsonwikipedia and other tools in this blog post [2]

[1] - https://github.com/idio/json-wikipedia

[2] - http://engineering.idioplatform.com/2016/02/18/wikipedia-toolkit.html

Upvotes: 0

Related Questions