pixelbruket
pixelbruket

Reputation: 23

Google Sheets - Remove all HTML tags except <img> and <a> tags

I have a Google Sheets document with data from a web scrape. I'd like to remove all HTML tags except links and images. I have managed to remove all HTML with the REGEXREPLACE function:

=REGEXREPLACE(A1;"</?\S+[^<>]*>";"")

...but I want to keep img and a-tags

Please help!

Upvotes: 2

Views: 604

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626926

You can use

=REGEXREPLACE(A1;"(?s)(<img[^>]*>|<a(?:\s[^>]*)?>.*?</a>)|</?\w[^>]*>";"$1")

See the regex demo.

Details

  • (?s) - a DOTALL modifier for the . to match any chars
  • (<img[^>]*>|<a(?:\s[^>]*)?>.*?</a>) - Capturing group 1 ($1 refers to this value from the replacement pattern):
    • <img[^>]*> - any img tag
    • | - or
    • <a(?:\s[^>]*)?>.*?</a> - any a tag with its open and close element and inner text
  • | - or
  • </?\w[^>]*> - <, an optional /, a word char, and then any zero or more chars other than > and then a > char.

Upvotes: 1

Related Questions