Remove duplicate XML tags with Awk?

Question

I have a very long custom XML with some issues that I need to fix before I can process in a BASH script.

The custom XML looks like this:


The cat
 is
 very
 cute

the 
dog 
is 
also 
very 

cute and nice

This is what I am trying to get:


The cat is very cute

the dog is also very cute and nice

A sentence should start with a and ends with a
The sentences are separated by some other tags (I do not have the list of all the possible tags).
I don't know how many words there are in each sentence.

I tried using a loop with grep on the first and then removing the additional tags with sed but it is clearly not going to work.

Is this possible to do this (probably with Awk)?

Thank you in advance,

tripleee · Accepted Answer

If your XML is literally this regular, it should be easy. The problem with using regular expressions and line-oriented tools on XML is that the XML syntax permits a lot of variations in line breaks and whitespace; but if your input doesn't have that, something like the following should work.

awk '/^/ {
    sub(/^ */, ""); sub(/ *<\/a:t>/, "");
    sent = (sent ? sent " " : "") $0
    next }
sent { print sent ""; sent="" }
1
END { if(sent) print sent "" }' file.xml

We collect the current sentence into the string variable sent, then print it out when we see a tag which is different than the sentence tag, or when we reach the end of the input file.

Repeating the print in the END block is unattractive, but I'm too lazy to go back and refactor.

Demo: https://ideone.com/B8SHOG

Remove duplicate XML tags with Awk?

Answers (2)

Related Questions