Reputation: 786

How do I extract data between HTML tags using Regex?

I've been assigned some sed homework in my class and am one step away from finishing the assignment. I've racked my head trying to come up with a solution and nothing's worked to the point where I'm about to give up.

Basically, in the file I've got...I'm supposed to replace this:

<b>Some text here...each bold tag has different content...</b>

with

Some text here...each bold tag has different content...

I've got it partially completed, but what I can't figure out is how to "echo" the extracted content using sed (regexp).

I manage to substitute the content out just fine, but it's when I'm trying to actually OUTPUT the content that's between the HTML tags that it goes wrong.

If that's confusing, I truly apologize. I've been at this project a couple hours now and am getting a bit frusturated. Basically, why does this not work?

s/<b>.*<\/b>/.*/g

I simply want to output the content WITHOUT the bold tags.

Thanks a bunch!

Upvotes: 0

Answers (3)

forivall

Reputation: 9913

You need to use a capturing group, which are parentheses ()

So, it's just this:

s/<b>(.*)<\/b>/\1/g

Capturing groups are numbered, from left to right, starting with one, and increasing.

This syntax is the standard way to do regular expressions; sed's syntax is slightly different. the sed command is

sed 's/<b>\(.*\)<\/b>/\1/g' [file]

sed -r 's/<b>(.*)<\/b>/\1/g' [file]

Of course, if you just want to remove the bold tags, the other solution would be to just replace the HTML tags with blanks like so

sed 's/<\([^>]\|\(\"[^\"]\"\)\)*>//g' [file]

(I dislike sed's need to escape everything)

s/<([^\]|(\"[^\"]\"))*>//g

Upvotes: 1

Andrew Clark

Reputation: 208665

If you want to reference a part of your regex match in the replacement, you need to place that portion of the regex into a capturing group, and then refer to it using the group number preceded by a backslash. Try the following:

s/<b>\(.*\)</b>/\1/g

Upvotes: 1

c-smile

Reputation: 27470

I think this question should be addressed to SED's mans. Like this: http://www.grymoire.com/Unix/Sed.html#uh-4

Upvotes: -1

How do I extract data between HTML tags using Regex?

Answers (3)

Related Questions