Reputation: 520
I have the following string:
<td class="mytest" title="testfile" style="width:20%">0</td>
How do I get value within the td elements by using awk? In my case, it is 0.
I am very new to Linux, any help is appreciated!
Upvotes: 0
Views: 298
Reputation: 203219
If your input is always that regular and you don't have and can't install XML-aware tools then using any sed in any shell on every Unix box:
$ sed 's:<td.*>\(.*\)</td>:\1:' file
0
I'm using sed instead of awk because simple substitutions on individual lines like this is what sed is best suited for. With GNU awk you could do this with the 3rd arg to match():
$ awk 'match($0,"<td.*>(.*)</td>",a){print a[1]}' file
0
but with a POSIX awk it'd be a bit more cryptic (there are alternative approaches of course):
$ awk 'sub("</td>","") && sub("<td.*>","")' file
0
Think about what the above is doing and test it to make sure you don't get any false matches. It's always much easier to match what you want than it is to not match similar strings you don't want.
Upvotes: 1
Reputation: 36370
If you are allowed to select your tool I would suggest using hxselect
(from html-xml-utils
), then if you have file.txt
holding
<td class="mytest" title="testfile" style="width:20%">0</td>
it would be as simple as
cat file.txt | hxselect -i -c td
output
0
Explanation: -i
match case insensitive, -c
print content only, td
is CSS selector. Disclaimer: there is not newline after 0
as there is not newline inside tag.
However if you are coerced into using installed base, then if linux machine you are using have installed python
(which if I am not mistaken, recent Ubuntu
versions do have by default), you might harness html.parser
as follows, create tdextract.py
file with following content
import sys
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
self.insidetd = False
super().__init__()
def handle_starttag(self, tag, attrs):
if tag == "td":
self.insidetd = True
def handle_endtag(self, tag):
if tag == "td":
self.insidetd = False
def handle_data(self, data):
if self.insidetd:
sys.stdout.write(data)
parser = MyHTMLParser()
parser.feed(sys.stdin.read())
then do
cat file.txt | python tdextract.py
which will give same output as hxselect
described earlier. Be warned that python
use indentation for marking blocks, so it is crucially important to keep number of leading spaces.
Upvotes: 2
Reputation: 117258
One option could be to use xmllint (xmllint --html
) with an xpath to extract the value.
Example:
#!/bin/bash
data='<td class="mytest" title="testfile" style="width:20%">0</td>'
value=$(xmllint --html --xpath '//html/body/td/text()' - <<< "$data")
echo "$value"
Output:
0
Upvotes: 2