Reputation: 11
First time sed'er, so be gentle.
I have the following text file, 'test_file':
<Tag1>not </Tag1><Tag2>working</Tag2>
I want to extract the text in between <Tag2>
using sed regex, there may be other occurrences of <Tag2>
and I would like to extract those also.
So far I have this sed based regex:
cat test_file | grep -i "Tag2"| sed 's/<[^>]*[>]//g'
which gives the output:
not working
Anyone any idea how to get this working?
Upvotes: 1
Views: 14735
Reputation: 342373
you can use gawk, eg
$ cat file
<Tag1>not </Tag1><Tag2>working here</Tag2>
<Tag1>not </Tag1><Tag2>
working
</Tag2>
$ awk -vRS="</Tag2>" '/<Tag2>/{gsub(/.*<Tag2>/,"");print}' file
working here
working
Upvotes: 0
Reputation: 139471
For your nice, friendly example, you could use
sed -e 's/^.*<Tag2>//' -e 's!</Tag2>.*!!' test-file
but the XML out there is cruel and uncaring. You're asking for serious trouble using regular expressions to scrape XML.
Upvotes: 4
Reputation: 20142
As another poster said, sed may not be the best tool for this job. You may want to use something built for XML parsing, or even a simple scripting language, such as perl.
The problem with your try, is that you aren't analyzing the string properly.
cat test_file
is good - it prints out the contents of the file to stdout.
grep -i "Tag2"
is ok - it prints out only lines with "Tag2" in them. This may not be exactly what you want. Bear in mind that it will print the whole line, not just the <Tag2> part, so you will still have to search out that part later.
sed 's/<[^>]*[>]//g'
isn't what you want - it simply removes the tags, including <Tag1> and <Tag2>.
You can try something like:
cat tmp.tmp | grep -i tag2 | sed 's/.*<Tag2>\(.*\)<\/Tag2>.*/\1/'
This will produce
working
but it will only work for one tag pair.
Upvotes: 4