user3809938
user3809938

Reputation: 1304

Search for substring in string using Bash?

How can I extract the currency1 field in the following string:

<fxQuotation><currency1>USD</currency1><currency2>AUD</currency2>

The result should be USD.

The below command would work:

echo "<fxQuotation><currency1>USD</currency1><currency2>AUD</currency2>" | cut -d">" -f3 | cut -d"<" -f1

However what if that string was a substring in a very big xml file, then my command would not work. How can I search based on the currency1 field.

Upvotes: 1

Views: 151

Answers (3)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89639

Very easy using xidel:

xidel file.xml --extract "//currency1" -q

or

xidel file.xml --xpath "//currency1" -q

The two work with badly formatted XML/HTML/XML with text...

Upvotes: 2

fotonix
fotonix

Reputation: 186

You would be best off using a small custom program in C or Python, but 'awk' and 'sed' are old tools that may offer a simple solution in a shell script: see Print XML element with AWK but the big thing is ensuring your input is pristine and well-formed.

Upvotes: 1

riteshtch
riteshtch

Reputation: 8769

It's better to use a xml parser or xml querying language instead of regex and bash commands.

For Java see DOM,SAX,StAX etc based xml parsers. DOM loads all of your xml as a tree representation in memory, so it's fast but memory inefficient; on the other hand SAX and StAX are much more better as they handle xml in pull or push fashion firing events. So you just have to write event handlers for their events.
WoodStox library is a good, efficient and sort of configurable xml parser. More info: https://www.javacodegeeks.com/2013/05/parsing-xml-using-dom-sax-and-stax-parser-in-java.html
http://www.studytrails.com/java/xml/woodstox/java-xml-stax-woodstox-basic-parsing.jsp

You can also use SQL like syntax for xml by using XQuery; another language to get your data can be xpath.

http://www.w3schools.com/xsl/xpath_intro.asp
http://www.w3schools.com/xsl/xquery_intro.asp

But if you still insist using bash tools.. just grep your string with -o option to get your desired tag along with its content(-o returns only strings which match regex line by line) and then remove the tags using cut or sed or any other tool:

$ cat file1
text text abcd
cxyz
xyz

</rootelement>
<abcd>
<xyz><fxQuotation><currency1>USD</currency1><currency2>AUD</currency2></fxQuotation></xyz>
</abcd>
</rootelement>
$ egrep -o '<currency1>[^<]*</currency1>' file1
<currency1>USD</currency1>
$ egrep -o '<currency1>[^<]*</currency1>' file1 | sed -r 's/<[^>]*>//g'
USD
$ grep -oP '(?<=<currency1>)[^<]*(?=</currency1>)' file1
USD
$

Upvotes: 1

Related Questions