Tom
Tom

Reputation: 37

Using Regex to remove html elements and leave the content

Lets say I have the following html

<b>Item 1</b> Text <br>
<b>Item 2</b> Text <br>
<b>Item 3</b> Text <br>
<p><font color="#000000" face="Arial, Helvetica, sans-serif"><b>Item 4:</b></font></p>
<p><font color="#000000" face="Arial, Helvetica, sans-serif">Detailed Description</font></p>

and am using the following regex to capture data (Item 1:.*?<br>)/gi which returns <b>Item 1</b> Text <br>

How do i drop or remove the <b>,</b> and <br>

to be left with

Item 1 Text

I've been trying to make sense of this code <(\w+)[^>]*>.*<\/\1>, but so far no luck. All the examples I have seen on here seem to require an id class, which my html does not have so i'm a bit stuck in getting those examples to fit my problem.

Upvotes: 1

Views: 1584

Answers (4)

J-Mik
J-Mik

Reputation: 896

in a regex, what is between () represents capture groups that can be later accessed as variables (\1 \2 \3 etc.) or sometimes $1 $2 $3. So simply use them to capture the text you want.

I think this regex would work for you:

<b>(Item \d+)</b>(.*?)<br>

in details, the expression means:

  • (Item \d+): Any string formatted as "Item [at least 1 digit]"
  • (.*?): any group of characters, the ? minimizes the number of characters in the sequence.

So now in <b>Item 5434</b>hel34lo 0345 345<br>, with regex above your captured groups are:

  • \1 = Item 5434
  • \2 = hel34lo 0345 345

I've never programmed in javascript, but more precisely, this piece of code might work:

var myString = "<b>Item 5434</b>hel34lo 0345 345<br>";
var myRegexp = /<b>(Item \d+)</b>(.*?)<br>/g;
var match = myRegexp.exec(myString);
alert(match[1]); // Item 5434 
alert(match[2]); // hel34lo 0345 345

Upvotes: 0

Francis Gagnon
Francis Gagnon

Reputation: 3675

This regex will match b and br tags:

</?br?\s*/?>

To use it in Javascript you write something like this:

result = subject.replace(/<\/?br?\s*\/?>/img, "");

All the matched tags will be replaced with an empty string.

In my experience it is better to replace br tags with a space and replace normal inline tags with empty string. If that is what you want to do, this next regex matches only b tags:

</?b\s*/?>

and this one matches only br tags:

</?br\s*/?>

Upvotes: 1

rtcherry
rtcherry

Reputation: 4880

This should do the trick:

var matches = stringToTest.match(/(Item \d+.*?<br\/?>)/gi);
for (var i = 0; i < matches.length; i++) {
  matches[i] = matches[i].replace(/<[^>]+>/g, '');
}
alert(matches);

If you have jQuery:

alert(
    $.map(stringToTest.match(/(Item \d+.*?<br\/?>)/gi), function(v) { return v.replace(/<[^>]+>/g, '') })
);

Upvotes: 1

Sanjeev
Sanjeev

Reputation: 1866

Try this reg ex: <[^>]*>

This will remove all the html with or without attributes and closing tags.

Upvotes: 3

Related Questions