Reputation: 23
I have a Google Sheets document with data from a web scrape. I'd like to remove all HTML tags except links and images. I have managed to remove all HTML with the REGEXREPLACE function:
=REGEXREPLACE(A1;"</?\S+[^<>]*>";"")
...but I want to keep img and a-tags
Please help!
Upvotes: 2
Views: 604
Reputation: 626926
You can use
=REGEXREPLACE(A1;"(?s)(<img[^>]*>|<a(?:\s[^>]*)?>.*?</a>)|</?\w[^>]*>";"$1")
See the regex demo.
Details
(?s)
- a DOTALL modifier for the .
to match any chars(<img[^>]*>|<a(?:\s[^>]*)?>.*?</a>)
- Capturing group 1 ($1
refers to this value from the replacement pattern):
<img[^>]*>
- any img
tag|
- or<a(?:\s[^>]*)?>.*?</a>
- any a
tag with its open and close element and inner text|
- or</?\w[^>]*>
- <
, an optional /
, a word char, and then any zero or more chars other than >
and then a >
char.Upvotes: 1