What is the best way to parse Wikipedia markup in PHP?

Question

I'm trying to parse specific Wikipedia content in a structured way. Here's an example page:

http://en.wikipedia.org/wiki/Polar_bear

I'm having some success. I can detect that this page is a "specie" page, and I can also parse the Taxobox (on the right) information into a structure. So far so good.

However, I'm also trying to parse the text paragraphs. These are returned by the API in Wiki format or HTML format, I'm currently working with the Wiki format.

I can read these paragraphs, but I'd like to "clean" them in a specific way, because ultimately I will have to display it in my app and it has no sense of Wiki markup. For example, I'd like to remove all images. That's fairly easy by filtering out [[Image:]] blocks. Yet there are also blocks that I simply cannot remove, such as:

Removing this entire block would break the sentence. And there are dozens of notations like this that have special meaning. I'd like to avoid writing a 100 regular expressions to process all of this and see how I can parse this in a smarter way.

My dilemma is as follow:

I could continue my current path of semi-structured parsing where I'd have a lot of work deleting unwanted elements as well as "mimicing" templates that do need to be rendered.
Or, I could start with the rendered HTML output and parse that, but my worry is that it's just as fragile and complex to parse in a structured way

Ideally, there's be a library to solve this problem, but I haven't found one yet that is up to this job. I also had a look at structured Wikipedia databases like DBPedia but those only have the same structured I already have, they don't provide any structure in the Wiki text itself.

svick · Accepted Answer

There are too many templates in use to reimplement all of them by hand and they change all the time. So, you will need actual parser of the wiki syntax that can process all the templates.

And the wiki syxtax is quite complex, has lots of quirks and no formal specification. This means creating your own parser would be too much work, you should use the one in MediaWiki.

Because of this, I think getting the parsed HTML through the MediaWiki API is your best bet.

One thing that's probably easier to be parsed from wiki markup are the infoboxes, so maybe they should be a special case.

What is the best way to parse Wikipedia markup in PHP?

Answers (1)

Related Questions