urig
urig

Reputation: 16831

How to match text and skip HTML tags using a regular expression?

I have a bunch of records in a QuickBase table that contain a rich text field. In other words, they each contain some paragraphs of text intermingled with HTML tags like <p>, <strong>, etc.

I need to migrate the records to a new table where the corresponding field is a plain text field. For this, I would like to strip out all HTML tags and leave only the text in the field values.

For example, from the below input, I would expect to extract just a small example link to a webpage:

  <p>just a small <a href="#">
  example</a> link</p><p>to a webpage</p> 

As I am trying to get this done quickly and without coding or using an external tool, I am constrained to using Quickbase Pipelines' Text channel tool. The way it works is that I define a regex pattern and it outputs only the bits that match the pattern.

So far I've been able to come up with this regular expression (Python-flavored as QB's backend is written in Python) that correctly does the exact opposite of what I need. I.e. it matches only the HTML tags:

/(<[^>]*>)/

In a sense, I need the negative image of this expression but have not be able to build it myself.

Your help in "negating" the above expression is most appreciated.

Upvotes: 0

Views: 204

Answers (1)

bobble bubble
bobble bubble

Reputation: 18515

Assuming there are no < or > elsewhere or entity-encoded, an idea using a lookbehind.

(?:(?<=>)|^)[^<]+

See this demo at regex101

(?:(?<=>)|^) is an alternation between either ^ start of the string or looking behind for any >. From there [^<]+ matches one or more characters that are not < (negated character class).

Upvotes: 3

Related Questions