Michelle
Michelle

Reputation: 2235

Using SED to remove specific anchor tags within html in database

I've got a table which contains hundreds of guides with screenshots. The screenshots images were surrounded by anchor tags as they were clickable before but now I need to remove the anchor tags. All the anchor tags to be removed have an href=#screenshot followed by a number as in the example below. My plan is to dump the table using mysqldump and then use sed to find and replace the correct strings.

<p>Choose <a href="/components">components</a> to install and click next.</p>
<div class="screen">
<a href="#screenshot3"><img src="/images/screens/install/step3.jpg" alt="Step 3"></a>
</div>

Should be

<p>Choose <a href="/components">components</a> to install and click next.</p>
<div class="screen">
<img src="/images/screens/install/step3.jpg" alt="Step 3">
</div>

I can match the first tag using <a\shref\=\"#screenshot\d+\"\> but I also need to match its second closing tag so that both can be removed whilst not removing other anchor tags. Any help would be greatly appreciated!

Upvotes: 0

Views: 495

Answers (1)

alestanis
alestanis

Reputation: 21863

You can try replacing

<a\shref\=\"#screenshot\d+\"\>(.*)<\/a>

with \1.

The parenthesis will capture everything that is found between them so you can restore it using \1, \2...

Keep in mind though that regexes are not the right weapon to use when trying to modify HTML. Read this (and the comments around it) for an explanation.

Upvotes: 1

Related Questions