Reputation: 85378

PHP RegEx (or Alt Method) for Anchor tags

Ok I have to parse out a SOAP request and in the request some of the values are passed with (or inside) a Anchor tag. Looking for a RegEx (or alt method) to strip the tag and just return the value.

// But item needs to be a RegEx of some sort, it's a field right now
if($sObject->list == 'item') {
   // Split on > this should be the end of the right side of the anchor tag
   $pieces = explode(">", $sObject->fields->$field);

   // Split on < this should be the closing anchor tag
   $piece = explode("<", $pieces[1]);

   $fields_string .= $piece[0] . "\n";
}

item is a field name but I would like to make this a RegEx to check for the Anchor tag instead of a specific field.

Upvotes: 0

Answers (5)

NawaMan

Reputation: 932

If you want to strip or extract properties from only specific tag, you should try DOMDocument.

Something like this:


$TagWhiteList = array(
    // Example of WhiteList
    'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);

function getTextFromNode($Node, $Text = "") {
    // No tag, so it is a text
    if ($Node->tagName == null)
        return $Text.$Node->textContent;
        
    // You may select a tag here
    // Like:
    // if (in_array($TextName, $TagWhiteList)) 
    //     DoSomthingWithIt($Text,$Node);

    // Recursive to child
    $Node = $Node->firstChild;
    if ($Node != null)
        $Text = getTextFromNode($Node, $Text);

    // Recursive to sibling
    while($Node->nextSibling != null) {
        $Text = getTextFromNode($Node->nextSibling, $Text);
        $Node = $Node->nextSibling;
    }
    return $Text;
}

function getTextFromDocument($DOMDoc) {
    return getTextFromNode($DOMDoc->documentElement);
}

To use:

$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
$Text = getTextFromDocument($Doc);
echo "Text from HTML: ".$Text."\n";

The above function is how to strip tags. But you can modify it a bit to manipulate the element. For example, if the tag is 'a' of archor, you can extract its target and display it instead of the text inside.

Hope this help.

Upvotes: 0

zanbaldwin

Reputation: 1009

I agree with cletus, using RegEx on HTML is bad practice because of how loose HTML is as a language (and I moan about PHP being too loose...). There are just so many ways you can variate a tag that unless you know that the document is standards-compliant / strict, it is sometimes just impossible to do. However, because I like a challenge that distracts me from work, here's how you might do it in RegEx!

I'll split this up into sections, no point if all you see is a string and say, "Meh... It'll do..."! First we have the main RegEx for an anchor tag:

'#<a></a>#'

Then we add in the text that could be between the tags. We want to group this is parenthesis, so we can extract the string, and the question mark makes the asterix wildcard "un-greedy", meaning that the first </a> that it comes accross will be the one it uses to end the RegEx.

'#<a>(.*?)</a>#'

Next we add in the RegEx for href="". We match the href=" as plain text, then an any-length string that does not contain a quotation mark, then the ending quotation mark.

'#<a href\="([^"]*)">(.*?)</a>#'

Now we just need to say that the tag is allowed other attributes. According to the specification, an attribute can contain the following characters: [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*. Allow an attribute multiple times, and with a value, we get: ( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*.

The resulting RegEx (PCRE) is as following:

'#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#'

Now, in PHP, use the preg_match_all() function to grab all occurances in the string.

$regex = '#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#';
preg_match_all($regex, $str_containing_anchors, $result);
foreach($result as $link)
 {
  $href = $link[2];
  $text = $link[4];
 }

Upvotes: 1

VolkerK

Reputation: 96189

If you don't have some kind of request<->class mapping you can extract the information with the DOM extension. The property textConent contains all the text of the context node and its descendants.

$sr = '<?xml version="1.0"?>
<SOAP:Envelope xmlns:SOAP="urn:schemas-xmlsoap-org:soap.v1">
  <SOAP:Body>
    <foo:bar xmlns:foo="urn:yaddayadda">
       <fragment>
         <a href="....">Mary</a> had a
         little <a href="....">lamb</a>
       </fragment>
    </foo:bar>
  </SOAP:Body>
</SOAP:Envelope>';

$doc = new DOMDocument;
$doc->loadxml($sr);

$xpath = new DOMXPath($doc);
$ns = $xpath->query('//fragment');
if ( 0 < $ns->length ) {
  echo $ns->item(0)->nodeValue;
}

prints

Mary had a
little lamb

Upvotes: 0

w35l3y

Reputation: 8783

use simplexml and xpath to retrieve the desired nodes

Upvotes: 0

cletus

Reputation: 625397

PHP has a strip_tags() function.

Alternatively you can use filter_var() with FILTER_SANITIZE_STRING.

Whatever you do don't parse HTML/XML with regular expressions. It's really error-prone and flaky. PHP has at least 3 different parsers as standard (SimpleXML, DOMDocument and XMLReader spring to mind).

Upvotes: 3

PHP RegEx (or Alt Method) for Anchor tags

Answers (5)

Related Questions