Reputation: 1304
How can I extract the currency1 field in the following string:
<fxQuotation><currency1>USD</currency1><currency2>AUD</currency2>
The result should be USD.
The below command would work:
echo "<fxQuotation><currency1>USD</currency1><currency2>AUD</currency2>" | cut -d">" -f3 | cut -d"<" -f1
However what if that string was a substring in a very big xml file, then my command would not work. How can I search based on the currency1 field.
Upvotes: 1
Views: 151
Reputation: 89639
Very easy using xidel:
xidel file.xml --extract "//currency1" -q
or
xidel file.xml --xpath "//currency1" -q
The two work with badly formatted XML/HTML/XML with text...
Upvotes: 2
Reputation: 186
You would be best off using a small custom program in C or Python, but 'awk' and 'sed' are old tools that may offer a simple solution in a shell script: see Print XML element with AWK but the big thing is ensuring your input is pristine and well-formed.
Upvotes: 1
Reputation: 8769
It's better to use a xml parser or xml querying language instead of regex and bash commands.
For Java see DOM
,SAX
,StAX
etc based xml parsers. DOM
loads all of your xml as a tree representation in memory, so it's fast but memory inefficient; on the other hand SAX
and StAX
are much more better as they handle xml
in pull or push fashion firing events. So you just have to write event handlers for their events.
WoodStox
library is a good, efficient and sort of configurable xml parser. More info: https://www.javacodegeeks.com/2013/05/parsing-xml-using-dom-sax-and-stax-parser-in-java.html
http://www.studytrails.com/java/xml/woodstox/java-xml-stax-woodstox-basic-parsing.jsp
You can also use SQL like syntax for xml by using XQuery; another language to get your data can be xpath.
http://www.w3schools.com/xsl/xpath_intro.asp
http://www.w3schools.com/xsl/xquery_intro.asp
But if you still insist using bash tools.. just grep your string with -o
option to get your desired tag along with its content(-o
returns only strings which match regex line by line) and then remove the tags using cut
or sed
or any other tool:
$ cat file1
text text abcd
cxyz
xyz
</rootelement>
<abcd>
<xyz><fxQuotation><currency1>USD</currency1><currency2>AUD</currency2></fxQuotation></xyz>
</abcd>
</rootelement>
$ egrep -o '<currency1>[^<]*</currency1>' file1
<currency1>USD</currency1>
$ egrep -o '<currency1>[^<]*</currency1>' file1 | sed -r 's/<[^>]*>//g'
USD
$ grep -oP '(?<=<currency1>)[^<]*(?=</currency1>)' file1
USD
$
Upvotes: 1