JiminP
JiminP

Reputation: 2142

How to remove text between "matching" parentheses?

When I read the alt(technically title)-text of this XKCD comic, I became curious whether every articles in Wikipedia eventually points to Philosophy article. So I began to make a web application that displays what articles it's "pointing" using PHP.

(PS: don't worry about traffic - because I'll use it privately and will not send too much requests to Wikipedia server)

To do this, I have to remove texts between parentheses and italics, and get the first link. Other things can be achieved using PHP Simple HTML DOM Parser, but remove texts between parentheses is the problem..

If there's no parentheses in parentheses, then I could use this RegEx:\([^\)]+\), however, like the article about German language, there's some articles have overlapped parentheses(for example: German (Deutsch [ˈdɔʏtʃ] ( listen)) is..), and above RegEx can't handle these cases, since [^\)]*\) finds first closing parentheses, not matching closing parentheses. (Actually above case doesn't become a problem since there's no text between two closing parentheses, but it becomes a big problem when there's a link between two closing parentheses.)

One dirty solution I can think is this:

$s="content of a wikipedia article";$depth=0;$s2="";
for($i=0;$i<strlen($s);$i++){
    $c=substr($s,$i,1);
    if($c=='(')$depth++;
    if($c==')'){if($depth>0)$depth--;continue;}
    if($depth==0) $s2.=$c;
}
$s=$s2;

However, I don't like this solution since it cuts down a string into single characters and that looks like unnecessary...

Is there other ways to remove text in a pair of(matching) parentheses?

For example, I want to make this text:

blah(asdf(foo)bar(lol)asdf)blah

into this:

blahblah

but not like this:

blahbarasdf)blah

Edit : from a comment of Emil Vikström's answer, I realized that above approach(remove texts between parentheses) may remove a link containing parentheses. However, I still want the answer of above problem, since I met similar problem before and I want to know the answer...

So my question is still: how to remove texts between matching parentheses?

Upvotes: 3

Views: 633

Answers (2)

Raj
Raj

Reputation: 22956

Great! I am seeing someone with a problem which I experienced while cleaning up Wikipedia plain text content. Here is how you use it.

cleanBraces("blah(asdf(foo)bar(lol)asdf)blah", "(", ")")

will return

blahblah

You can pass any type of braces. Like [ and ] or { and }

Here goes my source code.

function cleanBraces($source, $oB, $eB) {
    $finalText = "";
    if (preg_match("/\\$oB.*\\$eB/", $source) > 0) {
        while (preg_match("/\\$oB.*\\$eB/", $source) > 0) {
            $brace = getBracesPos($source, $oB, $eB);
            $finalText .= substr($source, 0, $brace[0]);
            $source = substr($source, $brace[1] + 1, strlen($source) - $brace[1]);
        }
        $finalText .= $source;
    } else {
        $finalText = $source;
    }
    return $finalText;
}

function getBracesPos($source, $oB, $eB) {
    if (preg_match("/\\$oB.*\\$eB/", $source) > 0) {
        $open = 0;
        $length = strlen($source);
        for ($i = 0; $i < $length; $i++) {
            $currentChar = substr($source, $i, 1);
            if ($currentChar == $oB) {
                $open++;
                if ($open == 1) { // First open brace
                    $firstOpenBrace = $i;
                }
            } else if ($currentChar == $eB) {
                $open--;
                if ($open == 0) { //time to wrap the roots
                    $lastCloseBrace = $i;
                    return array($firstOpenBrace, $lastCloseBrace);
                }
            }
        } //for
    } //if
}

Upvotes: 1

Emil Vikstr&#246;m
Emil Vikstr&#246;m

Reputation: 91983

You can check out recursive patterns, which should be able to solve the problem.

When I read the comic I didn't have the willpower to get my head around recursive patterns, so I simplified it to find a link and only then check if it's in parenthesis. Here's my solution:

  //Fetch links
  $matches = array();
  preg_match_all('!<a [^>]*href="/wiki/([^:"#]+)["#].*>!Umsi', $text, $matches);
  $links = $matches[1];
  //Find first link not within parenthesis
  $found = false;
  foreach($links as $l) {
    if(preg_match('!\([^)]+/wiki/'.preg_quote($l).'.+\)!Umsi', $text)) {
      continue;
    }else{
      $found = true;
      break;
    }
  }

Here's my entire script: http://lajm.eu/emil/dump/filosofi.phps

Upvotes: 3

Related Questions