Reputation: 165
Problem Description:
Consider the below XML file:
<xmlhead1> <xmlsubhead1> <record> <field>Hello</field> <field>World</field> </record> <record> <field>DELETEKEY</field> <field>World1</field> </record> </xmlsubhead1> </xmlhead1>
My objective is to remove a "record" XML tag, when the field sub tag of that XML node contains DELETEKEY as the value.
So in the above XML file i will remove
<record> <field>DELETEKEY</field> <field>World1</field> </record>
Solution Chosen:
I tried to use GNU sed to solve the above problem:
Below is my code.
sed -n '
/<xmlhead1>/,/<\/xmlhead1>/{
/<xmlsubhead1>/,/<\/xmlsubhead1>/{
/<record>/,/<\/record>/{
#Append to hold space
H
#if match DELETEKEY, start delete processing for the xml <record> element
/<field>DELETEKEY<\/field>/{
s/.*//g ; x
b delete
}
#if you have reached the end tag of the <record> element,
#print the hold space and clear the buffers
/<\/record>/{
g ; s/^\n//g; p
s/.*//g ; x ; s/.*//g
}
#continue to next line
b
#delete processing
:delete
{
#clear pattern space.
s/.*//g
#Read Next Line and remove new line(\n)
N
s/^\n//g
#end delete processing when line matches the end tag </record>
/<\/record>/b
#else continue to get next line for delete process
b delete
}
}
}
}
#print all other lines
p
' $inputfile
The logic is as below:
- Match Address Range beginning with
<xmlhead1> and ending with </xmlhead1>
- Match Inner Address Range beginning with
<xmlsubhead1> and ending with </xmlsubhead1>
- Match Inner Address Range
<record> to </record>
- When inside
<record>
tag,
(i) append all the lines to Hold space.
(ii) if the line matchesDELETEKEY
, then this record has to be deleted.Do step iii and iv. Else if no matches, go to step v
(iii) For delete, clear the hold space and jump to the delete branch
(iv) In the delete branch, read all the next lines using 'N' command till</record>
is encountered. When</record>
is encountered, exit out of the loop and start processing the next line.
(v) When not processing delete logic, if</record>
is encountered, it means a block of<record> to </record>
is successfully processed and is present in the hold space.
(vi) so take it out of hold space and print it.
Output of the above logic:
<xmlhead1>
<xmlsubhead1>
<record>
<field>Hello</field>
<field>World</field>
</record>
</xmlhead1>
Problem in the output:
You can notice that the record element with the DELETEKEY is removed, but the </xmlsubhead1>
tag is missing in the output.
Issue Debugging:
On debugging i found that after encountering a </record>
line in the delete processing, inside the <record> to </record>
range, the inner address range match should have ended since i have read and processed the </record>
line.
But the <record> to </record>
range block seems to process the </xmlsubhead1>
line too.
I found this by adding the below code inside the command block of <record>
range.
/<record>/,/<\/record>/{
/<\/recordList>/{
s/.*/record list is inside the record to record range/g
p
}
Can someone explain this behavior of sed, that a range match is exceeding the actual match ? In this case <record> to </record> range match is also matching </xmlsubhead1>
Upvotes: 2
Views: 353
Reputation: 47229
I agree with the comments about using a proper XML parser.
The issue with your sed script, is that you read (N
) lines in your :delete
"function". Here is a working example that uses a simpler logic:
/<xmlhead1>/,/<\/xmlhead1>/{
/<xmlsubhead1>/,/<\/xmlsubhead1>/{
/<record>/ {
:a
N
/<\/record>/!ba
/<field>DELETEKEY<\/field>/d
}
}
}
p
I.e. when in the right context, read a full record (assuming simplified XML structure), if the record contains the offending text, delete it.
Upvotes: 0