Find a block of descriptive text inside html using regex

Question

I'm trying to figure out how to use look-ahead to try to capture the descriptive text in an html page such as




HTML Tags Stripper is designed to strip HTML tags from the text. It will also strip embedded JavaScript code, style information (style sheets), as well as code inside php/asp tags (<?php ?> <%php ?> <% %>). It will also replace sequence of new line characters (multiple) with only one. Allow tags feature is session sticky, i.e. it will remember allowed tags list, so you will have to type them only once.
You can either provide text in text area below, or enter URL of the web page. If URL provided then HTML Tags Stripper will visit web-page for its contents.
Known issues:

I figured a regex that looks for a '>' followed by at least 150 characters before a '<' would do the trick.

The closest I've gotten so far is:

(([^.<]){1,500})<

Which still misses on things like periods and other characters before and after the string.

Find a block of descriptive text inside html using regex

Answers (1)

Related Questions