Reputation: 607
What command should I be using to extract the text from within the following html code which sits in a "test.html" file containing : "<span id="imAnID">extractme</span>
" ?
The file will be larger so I need to point grep or sed to an id and then tell it to extract only the text from the tag having this ID. Assuming I run the terminal from the directory where the file resides, I am doing this:
cat test.html | sed -n 's/.*<span id="imAnID">\(.*\)<\/span>.*/\1/p'
What am I doing wrong? I get an empty output... Not opposed to using grep for this if it's easier.
Upvotes: 0
Views: 9825
Reputation: 3694
using grep -o
echo "<span id="imAnID" hello>extractme</span> <span id='imAnID'>extractmetoo</span>" | grep -oE 'id=.?imAnID[^<>]*>[^<>]+' | cut -d'>' -f2
will find:
#=>extractme
#=>extractmetoo
it will work if the span
element carrying the desired id
attribute comes immediately before the extractme
stuff.
Upvotes: 0
Reputation: 2909
awk, sed and grep are line-oriented tools. XML and HTML are based on tags. The two don't combine that well, though you can get by with awk, sed and grep on XML and HTML by using a pretty formatter on the XML or HTML before resorting to your line-oriented tools.
There's a program called xmlgawk that is supposed to be quite gawk-like, while still working on XML.
I personally prefer to do this sort of thing in Python using the lxml module, so that the XML/HTML can be fully understood without getting too wordy.
Upvotes: 0
Reputation: 10087
It is awkward to use awk, sed, or grep for this since these tools are line-based (one line at a time). Is it guaranteed that the span you are trying to extract is all on the same line? Is there any possibility of other tags used within the span (e.g. em
tags)? If not, then this sounds like a job for perl.
Upvotes: 0
Reputation: 47357
You can try doing it with awk
instead:
#!/bin/bash
start_tag="span id=\"imAnID\""
end_tag="/span"
awk -F'[<>]' -v taga="$start_tag" -v tagb="$end_tag" '{ i=1; while (i<=NF) { if ($(i)==taga && $(i+2)==tagb) { print $(i+1) }; i++} }'
Use this by:
$ ./script < infile > outfile
Upvotes: 0