Reputation: 1432
I have been searching for wikipedia dump parser into customizable xml , basically each article should be parsed into a set of section tags , containing the section plain text of the article. I come up with the following solutions
The problem with the first one is that it is available only on windows and the second doesn't give the capability of producing the sections in a nested xml scheme . Previous implementations of mwlib seems to provide such capabilities but sadly new versions are not . Is there any wikipedia xml dump parser on linux which can produce customizable xmls ?
Upvotes: 0
Views: 1165
Reputation: 828
I think this is doable using jsonwikipedia [1]. which generates a "json dump" out of the Wikipedia XML dump. more details on jsonwikipedia and other tools in this blog post [2]
[1] - https://github.com/idio/json-wikipedia
[2] - http://engineering.idioplatform.com/2016/02/18/wikipedia-toolkit.html
Upvotes: 0