kilrizzy
kilrizzy

Reputation: 2943

Parse Wikipedia lists and descriptions using regular expressions

Not being terribly familiar with regular expressions, I need to find a way to parse lists of items from Wikipedia. I've pulled the content using Wikipedia's api.php and I am left with data that looks like this:

    ==Formal fallacies==
    A [[formal fallacy]] is an error in logic that...

    * [[Appeal to probability]] –  takes something for granted because...
    * [[Argument from fallacy]] –  assumes that if an argument ...
    * [[Base rate fallacy]] –  making a probability judgement...
    * [[Conjunction fallacy]] –  assumption that an outcome simultaneously...
    * [[Masked man fallacy]] –  ...

    ===Propositional fallacies===

    * [[Affirming a disjunct]] –  concluded that ...
    * [[Affirming the consequent]] –  the [[antecedent...
    * [[Denying the antecedent]] –  the [[consequent]] in...

So, I need a way to pull the data so that:

Upvotes: 0

Views: 135

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89565

this do the job:

preg_match_all('~^\h*+\*\h*\[\[(?<name>[a-z ]++)]]\h*+[-–]\h*+(?<description>.++)$~imu', $text, $results, PREG_SET_ORDER);
foreach($results as &$result) { 
    foreach($result as $key=>$value) {
        if (is_numeric($key)) unset($result[$key]); }
}
echo '<pre>' . print_r($results, true) . '</pre>';

Upvotes: 1

Rok Burgar
Rok Burgar

Reputation: 949

First replace

^((?!\*\s\[\[).)*$

with blank. This will delete lines that don't contain * [[

Delete newlines replace

^\n|\r$

with blank.

Here is regex to get the title and description:

^\s+\*\s\[\[([^\]\]]*)\]\]\s–(.*)
Title: "$1", Description: "$2"

Upvotes: 0

Related Questions