Reputation: 548

Remove links with hash symbol from string

I have a database with HTML content and it has some text with links. Some texts have hash symbol in their URLs, some others no.

I need to delete the links with hash symbol, keeping those with no hash symbol on it.

Example:

Input:

<a href="http://example.com/books/1">The Lord of the Rings</a>
<ul>
    <li><a   href="http://example.com/books/1#c1" >Chapter 1</a></li>
    <li><a name="name before href" href="http://example.com/books/1#c2">Chapter 2</a></li>
    <li><a href="http://example.com/books/1#c3" name="name after href">Chapter 3</a></li>
    <li><a href="http://example.com/books/1#cN" target="_blank">Chapter N</a></li>
</ul>
<br><br>
<a href="http://example.com/books/1">Harry Potter</a>
<ul>
    <li><a href="http://example.com/books/2#c1" target="_self">Chapter 1</a></li>
    <li><a href="http://example.com/books/2#c2" name="some have name" title="some others have title" >Chapter 2</a></li>
    <li><a href="http://example.com/books/2#c3">Chapter 3</a></li>
    <li><a href="http://example.com/books/2#cN"  >Chapter N</a></li>
</ul>

Desired Output:

<a href="http://example.com/books/1">The Lord of the Rings</a>
<ul>
    <li>Chapter 1</li>
    <li>Chapter 2</li>
    <li>Chapter 3</li>
    <li>Chapter N</li>
</ul>
<br><br>
<a href="http://example.com/books/2">Harry Potter</a>
<ul>
    <li>Chapter 1</li>
    <li>Chapter 2</li>
    <li>Chapter 3</li>
    <li>Chapter N</li>
</ul>

I am trying with this code, but it delete all the links and I want to keep those with no hash symbol.

$content = preg_replace('#<a.*?>([^>]*)</a>#i', '$1', $content);

So, currently I am getting this:

The Lord of the Rings
<ul>
    <li>Chapter 1</li>
    <li>Chapter 2</li>
    <li>Chapter 3</li>
    <li>Chapter N</li>
</ul>
<br><br>
Harry Potter
<ul>
    <li>Chapter 1</li>
    <li>Chapter 2</li>
    <li>Chapter 3</li>
    <li>Chapter N</li>
</ul>

More details:

I am using PHP.
The only reference I have to know what links to delete is de # symbol.
Some links have new line.

Example:

<a href="http://example.com">
    new line</a>
or
<a href="http://example.com">new
    line</a>

Upvotes: 1

Answers (4)

Lawrence Cherone

Reputation: 46602

You should avoid using regex, instead you should use DOMDocument and DOMXPath.

<?php
$dom = new DOMDocument();

$dom->loadHtml('
<a href="http://example.com/books/1">The Lord of the Rings</a>
<ul>
    <li><a   href="http://example.com/books/1#c1" >Chapter 1</a></li>
    <li><a name="name before href" href="http://example.com/books/1#c2">Chapter 2</a></li>
    <li><a href="http://example.com/books/1#c3" name="name after href">Chapter 3</a></li>
    <li><a href="http://example.com/books/1#cN" target="_blank">Chapter N</a></li>
</ul>
<br><br>
<a href="http://example.com/books/1">Harry Potter</a>
<ul>
    <li><a href="http://example.com/books/2#c1" target="_self">Chapter 1</a></li>
    <li><a href="http://example.com/books/2#c2" name="some have name" title="some others have title" >Chapter 2</a></li>
    <li><a href="http://example.com/books/2#c3">Chapter 3</a></li>
    <li><a href="http://example.com/books/2#cN"  >Chapter N</a></li>
</ul>
', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);

foreach ($xpath->query("//a") as $link) {
    $href = $link->getAttribute('href');

    // link has a # in it, so replace with the links title
    if (strpos($href, '#') !== false) {
        $link->parentNode->nodeValue = $link->nodeValue;
    }
}

echo $dom->saveHTML();

https://3v4l.org/8FQYb

Result:

<a href="http://example.com/books/1">The Lord of the Rings<ul>
    <li>Chapter 1</li>
    <li>Chapter 2</li>
    <li>Chapter 3</li>
    <li>Chapter N</li>
</ul><br><br><a href="http://example.com/books/1">Harry Potter</a><ul>
    <li>Chapter 1</li>
    <li>Chapter 2</li>
    <li>Chapter 3</li>
    <li>Chapter N</li>
</ul></a>

Upvotes: 5

Capattax

Reputation: 131

After parsing through the HTML and selecting all the HTML links, you could use a foreach loop and str_replace on the condition that the string contains a pound/hash symbol.

<?php
//Save HTML code as an object using DOMDocument ($links) for parsing
foreach($links as $line) {
    if (str_pos($line, '#')) {
        str_replace($line, '', $links);
    }
}
?>

This would replace each line with a pound/hash symbol with a blank line and would be treated as such by the database.

Upvotes: 0

Wray Zheng

Reputation: 997

Use following pattern to match <a href=...> and </a> in the text, and replace the matched text with empty string.

(?<=<li>)<a.+?>|</a>(?=</li>)

This is to remove strings unwanted, instead of replacing whole text with wanted.

Upvotes: 0

Chromane

Reputation: 175

This regex statement matches the examples you've given. It detects those URL's with a # somewhere in the url. You can then write a replace statement and swap them all the text from capture group \1

<a(?:\s+name=".*?")?\s+href=.*?#.*?>(.*?)<\/a>

Regex in action

Upvotes: 2

Remove links with hash symbol from string

Answers (4)

Related Questions