How to match text and skip HTML tags using a regular expression?

Question

I have a bunch of records in a QuickBase table that contain a rich text field. In other words, they each contain some paragraphs of text intermingled with HTML tags like

, , etc.

I need to migrate the records to a new table where the corresponding field is a plain text field. For this, I would like to strip out all HTML tags and leave only the text in the field values.

For example, from the below input, I would expect to extract just a small example link to a webpage:

just a small example link to a webpage

As I am trying to get this done quickly and without coding or using an external tool, I am constrained to using Quickbase Pipelines' Text channel tool. The way it works is that I define a regex pattern and it outputs only the bits that match the pattern.

So far I've been able to come up with this regular expression (Python-flavored as QB's backend is written in Python) that correctly does the exact opposite of what I need. I.e. it matches only the HTML tags:

/(<[^>]*>)/

In a sense, I need the negative image of this expression but have not be able to build it myself.

Your help in "negating" the above expression is most appreciated.

bobble bubble · Accepted Answer

Assuming there are no < or > elsewhere or entity-encoded, an idea using a lookbehind.

(?:(?<=>)|^)[^<]+

See this demo at regex101

(?:(?<=>)|^) is an alternation between either ^ start of the string or looking behind for any >. From there [^<]+ matches one or more characters that are not < (negated character class).

How to match text and skip HTML tags using a regular expression?

Answers (1)

Related Questions