Reputation: 837
If I have a line of HTML
<td><em>data</em></td>
How can I print to stdout
<em>data</em>
if the beginning and end of the line have exactly
<td>
and
</td>
tags exactly. If the line does not start or end with td tags, do not print the line.
I tried
sed 's/<td>\(*\)</td>/\1/'
but it doesn't exactly work.
Thanks in advance.
Upvotes: 1
Views: 307
Reputation: 41456
This should do:
echo "<td><em>data</em></td>" | awk '{gsub(/<\/?td>/,x)}8'
<em>data</em>
Or this:
echo "<td><em>data</em></td>" | sed 's|</*td>||g'
<em>data</em>
Or this: (more exact, since ?
represent only a single character)
echo "<td><em>data</em></td>" | sed 's|</\?td>||g'
<em>data</em>
To go through what is wrong with your work sed 's/<td>\(*\)</td>/\1/'
You are nearly there, but this \(*\)
does not work, since it does not now what to repeat *
Adding a simple .
make this works, since it represent any characters. So it should be \(.*\)
In the second td
there is a forward slash /
. Since you are using /
as separator int must be
escaped like this /\
giving <\/td>
so this is working:
echo "<td><em>data</em></td>" | sed 's/<td>\(.*\)<\/td>/\1/g'
<em>data</em>
It could be change to:
echo "<td><em>data</em></td>" | sed 's|<td>\(.*\)</td>|\1|g'
<em>data</em>
But as you see in my example above, there are no need to use back reference. It better to just
remove what you do not need.
If this if the beginning and end of the line have exactly
mean start/end of line has nothing more.
back reference:
sed 's|^<td>\(.*\)</td>$|\1|g'
just delete:
sed 's:^<td>\|</td>$::g'
and awk
:
echo "<td><em>data</em></td>" | awk '{gsub(/^<td>|<\/td>$/,x)}8'
<em>data</em>
Upvotes: 3
Reputation: 10039
sed -n '\|^[[:blank:]]*<[tT][dD]>\(.*\)</[tT][dD]>[[:blank:]]*$| s//\1/p' YourFile
take only lines starting/closing with this td tag (with any space surrounding) a print the content (-posix with GNU sed)
Upvotes: 1
Reputation: 3269
Do you accept awk
?
cat INFILE.txt | awk '/<td>/ { found=1; next }; /<\/td>/ { found=0; next }; found {print}'
<td>
and </td>
even if the tags span multiple lines ;)Upvotes: 1
Reputation: 23502
$ sed -r 's:<td>(.*)<\/td>:\1:g' <<< '<td><em>data</em></td>'
<em>data</em>
If your requirement is as simple as you have mentioned in your question, then sed
is fine to use. However, if you want to parse HTML tags, then consider using perl
as sed
would be way to efficient in doing so. Use the right tool for the job.
Upvotes: 1