Reputation: 837

Using sed to extract HTML data

If I have a line of HTML

<td><em>data</em></td>

How can I print to stdout

<em>data</em>

if the beginning and end of the line have exactly

<td>

and

</td>

tags exactly. If the line does not start or end with td tags, do not print the line.

I tried

sed 's/<td>\(*\)</td>/\1/'

but it doesn't exactly work.
Thanks in advance.

Upvotes: 1

Answers (4)

Jotne

Reputation: 41456

This should do:

echo "<td><em>data</em></td>" | awk '{gsub(/<\/?td>/,x)}8'
<em>data</em>

Or this:

echo "<td><em>data</em></td>" | sed 's|</*td>||g'
<em>data</em>

Or this: (more exact, since ? represent only a single character)

echo "<td><em>data</em></td>" | sed 's|</\?td>||g'
<em>data</em>

To go through what is wrong with your work sed 's/<td>\(*\)</td>/\1/' You are nearly there, but this \(*\) does not work, since it does not now what to repeat *
Adding a simple . make this works, since it represent any characters. So it should be \(.*\)
In the second td there is a forward slash /. Since you are using / as separator int must be
escaped like this /\ giving <\/td> so this is working:

echo "<td><em>data</em></td>" | sed 's/<td>\(.*\)<\/td>/\1/g'
<em>data</em>

It could be change to:

echo "<td><em>data</em></td>" | sed 's|<td>\(.*\)</td>|\1|g'
<em>data</em>

But as you see in my example above, there are no need to use back reference. It better to just
remove what you do not need.

If this if the beginning and end of the line have exactly mean start/end of line has nothing more.
back reference:

sed 's|^<td>\(.*\)</td>$|\1|g'

just delete:

sed 's:^<td>\|</td>$::g'

and awk:

echo "<td><em>data</em></td>" | awk '{gsub(/^<td>|<\/td>$/,x)}8'
<em>data</em>

Upvotes: 3

NeronLeVelu

Reputation: 10039

sed -n '\|^[[:blank:]]*<[tT][dD]>\(.*\)</[tT][dD]>[[:blank:]]*$| s//\1/p' YourFile

take only lines starting/closing with this td tag (with any space surrounding) a print the content (-posix with GNU sed)

Upvotes: 1

csiu

Reputation: 3269

Do you accept awk?

cat INFILE.txt | awk '/<td>/ { found=1; next }; /<\/td>/ { found=0; next }; found {print}'

where INFILE.txt is the input file
This command will print between <td> and </td> even if the tags span multiple lines ;)

Upvotes: 1

slayedbylucifer

Reputation: 23502

$ sed -r 's:<td>(.*)<\/td>:\1:g' <<< '<td><em>data</em></td>'
<em>data</em>

If your requirement is as simple as you have mentioned in your question, then sed is fine to use. However, if you want to parse HTML tags, then consider using perl as sed would be way to efficient in doing so. Use the right tool for the job.

Upvotes: 1

Using sed to extract HTML data

Answers (4)

Related Questions