Reputation: 463
I have a big xhtml file with lots of junk text that I don't need. I only need whatever text that lies between two specific strings that occur many times within that file, e.g.
<html>
<xyz> unneeded text </xyz>
<mytag> important text1 </mytag>
<xyz> unneeded text </xyz>
<xyz> unneeded text </xyz>
<mytag> important text2 </mytag>
<mytag> important text3 </mytag>
<xyz> unneeded text </xyz>
</html>
My output should be:
important text1
important text2
important text3
I need to do that using Bash script.
Thanks for your help
Upvotes: 0
Views: 56
Reputation: 195039
Using regex on Xml format is risky, particularly with line based text processing tool grep. You cannot make sure that the result is always correct.
If your input was valid xml format, I would go with xml way: xpath expression.
With tool xmlstarlet
, you can do:
xmlstarlet sel -t -v "//mytag/text()" file.xml
It gives the desired output.
You can also do it with xmllint
, however, you need do some further filtering on the output.
Upvotes: 2
Reputation: 8769
Using XML parser is a better approach, there are also command line tools for xml parsing in Linux, eg: xmllint
but you can do it using grep
like this:
$ cat data1
<html>
<xyz> unneeded text </xyz>
<mytag> important text1 </mytag>
<xyz> unneeded text </xyz>
<xyz> unneeded text </xyz>
<mytag> important text2 </mytag>
<mytag> important text3 </mytag>
<xyz> unneeded text </xyz>
</html>
$ grep -oP '(?<=<mytag>).*(?=</mytag>)' data1
important text1
important text2
important text3
$
(?<=<mytag>).*(?=</mytag>)
this extracts text using positive lookahead and lookbehind assertions
Upvotes: 0
Reputation: 41987
Using an XML parser would be the best way to go.
Solution using grep
with PCRE:
grep -Po '^<mytag>\s*\K.*?(?=\s*</mytag>$)'
Example:
$ cat file.xml
<html>
<xyz> unneeded text </xyz>
<mytag> important text1 </mytag>
<xyz> unneeded text </xyz>
<xyz> unneeded text </xyz>
<mytag> important text2 </mytag>
<mytag> important text3 </mytag>
<xyz> unneeded text </xyz>
</html>
$ grep -Po '^<mytag>\s*\K.*?(?=\s*</mytag>$)' file.xml
important text1
important text2
important text3
Upvotes: 0