Parse Wikipedia lists and descriptions using regular expressions

Question

Not being terribly familiar with regular expressions, I need to find a way to parse lists of items from Wikipedia. I've pulled the content using Wikipedia's api.php and I am left with data that looks like this:

    ==Formal fallacies==
    A [[formal fallacy]] is an error in logic that...

    * [[Appeal to probability]] –  takes something for granted because...
    * [[Argument from fallacy]] –  assumes that if an argument ...
    * [[Base rate fallacy]] –  making a probability judgement...
    * [[Conjunction fallacy]] –  assumption that an outcome simultaneously...
    * [[Masked man fallacy]] –  ...

    ===Propositional fallacies===

    * [[Affirming a disjunct]] –  concluded that ...
    * [[Affirming the consequent]] –  the [[antecedent...
    * [[Denying the antecedent]] –  the [[consequent]] in...

So, I need a way to pull the data so that:

We are only paying attention to lines starting with * [[
Anything between * [[ ]] is the name
The remaining content after the - is the description

Casimir et Hippolyte · Accepted Answer

this do the job:

preg_match_all('~^\h*+\*\h*\[\[(?[a-z ]++)]]\h*+[-–]\h*+(?.++)$~imu', $text, $results, PREG_SET_ORDER);
foreach($results as &$result) { 
    foreach($result as $key=>$value) {
        if (is_numeric($key)) unset($result[$key]); }
}
echo '' . print_r($results, true) . '';

Parse Wikipedia lists and descriptions using regular expressions

Answers (2)

Related Questions