Reputation: 165
I have XML file similar to the following:
<?xml version="1.0" encoding="UTF-8"?>
<OnlineCommentary>
<doc docid="cnn_210085_comment002" articleURL="http://www.cnn.com/News.asp?NewsID=210085" date="10/07/2010" time="00:21" subtitle="Is Justin Bieber getting special treatment?" author="Zorro75">
<seg id="1"> They are the same thing. Let's shoot them both. </seg>
</doc>
<doc docid="cnn_210092_comment004" articleURL="http://www.cnn.com/News.asp?NewsID=210092" date="06/04/2010" time="17:07" subtitle="Dear Chicago, we love you despite it all" author="MRL1313">
<seg id="1"> We can't wait for you to move back either. </seg>
<seg id="2"> You seem quite uptight. </seg>
<seg id="3"> Does your wife (who is also your sister) not give it up any more? </seg>
</doc>
</OnlineCommentary>
I would like to to execute command on this file to extract only the contnet between the opening tag <seg ...>
and the closing tag </seg>
I tried :
sed -n 's:.*<seg id="1">\(.*\)</seg>.*:\1:p' XML-file.xml > output.txt
My questions are the following:
-- How can I print all <seg id="*">
?? my command prints only the the content of the first tag (<seg id="*">
)
-- Is that is there a way that can be used to make for example the <seg id="1">
, <seg id="2">
, <seg id="3">
to be printed in the same line while the tag that include only <seg id="1">
to be printed in separate line??
Upvotes: 0
Views: 3272
Reputation: 10039
print all the <seg id=>
(one per line) including <seg
sed -n 's:.*\(<seg id="[0-9]\{1,\}">.*</seg>\).*:\1:p' XML-file.xml > output.txt
Print all on 1 line with separated ,
. Use of holding buffer instead of printing and at the end, recall the buffer, replace new line by ,
(and remove starting ,
due to Append action), and print the result
sed -n '\:.*\(<seg id="[0-9]\{1,\}">.*</seg>\).*: { s//\1/
H
}
$ {g
s/\n/,/g;s/^,//
p
}' XML-file.xml > output.txt
Now, the advice of @Choroba to use adequat XML tools is very good, you minimize the risk of treating unwanted data of the file.
Upvotes: 1