Reputation: 313
I am using preg_replace to strip out <p>
tags and <li>
tags and making them carriage returns. I have some <a
> tags in my string, and I want to strip those out, but keep the href attribute. For instance, if I have:
<a href = "http://www.example.com">Click Here</a>
, what I want is: http://www.example.com
Click Here
Here is what I have so far
$text .= preg_replace(array("/<p[^>]*>/iU","/<\/p[^>]*>/iU","/<ul[^>]*>/iU","/<\/ul[^>]*>/iU","/<li[^>]*>/iU","/<\/li[^>]*>/iU"), array("","\r\n\r\n","","\r\n\r\n","","\r\n"), $content);
Thanks
Upvotes: 0
Views: 305
Reputation: 18455
If I were you I would use SimpleHTMLDom. Here's a usage example from the docs:
// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');
$html->find('div', 1)->class = 'bar';
$html->find('div[id=hello]', 0)->innertext = 'foo';
echo $html;
// Output: <div id="hello">foo</div><div id="world" class="bar">World</div>
Upvotes: 3
Reputation: 34395
If a regex solution is desired, here is a tested function which handles the anchor tags as you requested (with notable caveats noted below.) The regex is presented in verbose mode with comments:
function process_markup($content) {
return preg_replace(
array( // Regex patterns
'%<(?:p|ul|li)[^>]*>%i', // Open tags.
'%<\/(?:p|ul|li)[^>]*>\s*%i', // Close tags.
'% # Match A element (with no "<>" in attributes!)
<a\b # Start tag name.
[^>]+? # anything up to HREF attribute.
href\s*=\s* # HREF attribute name and "="
(["\']?) # $1: Optional quote delimiter
([^>\s]+) # $2: HREF attribute value.
(?(1)\1) # If open quote, match close quote.
[^>]*> # Remainder of start tag
(.*?) # $3: A element contents.
</a\s*> # A element end tag.
%ix'
),
array( // Replacement strings
"", # Simply strip P, UL, and LI open tags.
"\r\n", # Replace close tags with line endings.
"$2 $3" # Keep A element HREF value and contents.
), $content);
}
I took the liberty of modifying the other regexes as well. Adjust as necessary.
CAVEATS: This regex solution assumes: All A
, P
, UL
and LI
elements have no angle brackets <>
in their attributes. There are no A
, P
, UL
or LI
element start or end tags within any CDATA
sections such as SCRIPT
or STYLE
elements, or HTML comments, or inside other start tag attributes. Otherwise, this should work pretty well for a lot of HTML markup.
I realize that many wince when they hear the words: HTML
and REGEX
spoken in the same breath, but in this particular case, I think a regex solution will work quite well (within the above limitations). The A
tag is one of those which is not nested, so a regex can easily match the start tag, contents and end tag all in one whack. Same thing with the individual start and end tags for the other elements (which can be nested) when considered independently.
Upvotes: 0