Using a modified Nokogiri to parse Wikitext?

Question

Apologies for the length of this question, it's more of a "is this possible" than "how do I do this".

My objective is to remove everything but plain text from Wikipedia markup -- tables, templates, formatting. Whether these are in wikitext markup (e.g. ''bold text'') or HTML (bold text).

Wikipedia text is a mix of custom tags: templates {{ ... }}, tables {| ... |}, links [[ ... ]] and HTML elements. Parsing it is kind of a nightmare. You can't use regular expressions because the tags can be nested, and it can contain HTML so almost anything is possible. Some of the text within the HTML I'd want to keep (stuff within bold text), but other things like tables would need to be stripped entirely.

I thought about re-purposing an XML parser like Nokogiri, adding {{/}} as alternatives to /.

Does anyone who knows Nokogiri (or another Ruby XML parser) know if this is possible or even a good idea?

My alternative is to repurpose an existing parser like WikiCloth for the wiki markup, and then try to remove any leftover HTML via another method.

Phrogz · Accepted Answer

This sounds like a good idea. However, it would not be possible for you to 'patch' Nokogiri, "adding {{/}} as alternatives to /". This is because the bulk of the work done by Nokogiri—parsing and XPath and generating the string representation of a DOM—is actually done by libxml2 in the back end. You'd have to patch and recompile libxml2 (and then rebuild Nokogiri against your new version)…but at that point I have no idea how Nokogiri would behave.

You might have better luck trying to patch REXML, since that is written in pure Ruby.

Using a modified Nokogiri to parse Wikitext?

Answers (1)

Related Questions