Reputation: 177
I have a very long custom XML with some issues that I need to fix before I can process in a BASH script.
The custom XML looks like this:
<SOME TAGS>
<a:t>The cat</a:t>
<a:t> is</a:t>
<a:t> very</a:t>
<a:t> cute</a:t>
<SOME OTHER TAGS>
<a:t>the </a:t>
<a:t>dog </a:t>
<a:t>is </a:t>
<a:t>also </a:t>
<a:t>very </a:t>
<a:t></a:t>
<a:t>cute and nice</a:t>
<ANOTHER TAG>
This is what I am trying to get:
<SOME TAGS>
<a:t>The cat is very cute</a:t>
<SOME OTHER TAGS>
<a:t>the dog is also very cute and nice</a:t>
<ANOTHER TAG>
I tried using a loop with grep on the first <a:t> and then removing the additional tags with sed but it is clearly not going to work.
Is this possible to do this (probably with Awk)?
Thank you in advance,
Upvotes: 1
Views: 221
Reputation: 189628
If your XML is literally this regular, it should be easy. The problem with using regular expressions and line-oriented tools on XML is that the XML syntax permits a lot of variations in line breaks and whitespace; but if your input doesn't have that, something like the following should work.
awk '/^<a:t>/ {
sub(/^<a:t> */, ""); sub(/ *<\/a:t>/, "");
sent = (sent ? sent " " : "<a:t>") $0
next }
sent { print sent "</a:t>"; sent="" }
1
END { if(sent) print sent "</a:t>" }' file.xml
We collect the current sentence into the string variable sent
, then print it out when we see a tag which is different than the sentence tag, or when we reach the end of the input file.
Repeating the print in the END block is unattractive, but I'm too lazy to go back and refactor.
Demo: https://ideone.com/B8SHOG
Upvotes: 1
Reputation: 23677
With perl
, assuming <a:t>...</a:t>
are always on their own line and no other text. Since entire input is being slurped, this is not a good solution for very large files.
$ perl -0777 -pe 's%(^<a:t>.*</a:t>\n)+%$&=~s#(?<!\A)<a:t>|</a:t>\n(?!\z)##rg%gme' ip.txt
<SOME TAGS>
<a:t>The cat is very cute</a:t>
<SOME OTHER TAGS>
<a:t>the dog is also very cute and nice</a:t>
<ANOTHER TAG>
-0777
slurp entire input(^<a:t>.*</a:t>\n)+
match one or more lines starting with <a:t>
and ending with </a:t>
(.
won't match newline without s
flag)
(?<!\A)<a:t>|</a:t>\n(?!\z)
will match <a:t>
and </a:t>
except at the start/end of the matched stringe
flag allows to use Perl code in the replacement section, used here to perform another substitutionm
flag allows ^
to match at start of every lineUpvotes: 2