Reputation: 2943
Not being terribly familiar with regular expressions, I need to find a way to parse lists of items from Wikipedia. I've pulled the content using Wikipedia's api.php and I am left with data that looks like this:
==Formal fallacies==
A [[formal fallacy]] is an error in logic that...
* [[Appeal to probability]] – takes something for granted because...
* [[Argument from fallacy]] – assumes that if an argument ...
* [[Base rate fallacy]] – making a probability judgement...
* [[Conjunction fallacy]] – assumption that an outcome simultaneously...
* [[Masked man fallacy]] – ...
===Propositional fallacies===
* [[Affirming a disjunct]] – concluded that ...
* [[Affirming the consequent]] – the [[antecedent...
* [[Denying the antecedent]] – the [[consequent]] in...
So, I need a way to pull the data so that:
Upvotes: 0
Views: 135
Reputation: 89565
this do the job:
preg_match_all('~^\h*+\*\h*\[\[(?<name>[a-z ]++)]]\h*+[-–]\h*+(?<description>.++)$~imu', $text, $results, PREG_SET_ORDER);
foreach($results as &$result) {
foreach($result as $key=>$value) {
if (is_numeric($key)) unset($result[$key]); }
}
echo '<pre>' . print_r($results, true) . '</pre>';
Upvotes: 1
Reputation: 949
First replace
^((?!\*\s\[\[).)*$
with blank. This will delete lines that don't contain * [[
Delete newlines replace
^\n|\r$
with blank.
Here is regex to get the title and description:
^\s+\*\s\[\[([^\]\]]*)\]\]\s–(.*)
Title: "$1", Description: "$2"
Upvotes: 0