Reputation: 165

sed - regex address range is matching lines outside the range when N command is used

Problem Description:

Consider the below XML file:
<xmlhead1>
   <xmlsubhead1>
       <record>
           <field>Hello</field>
           <field>World</field>
       </record>
       <record>
           <field>DELETEKEY</field>
           <field>World1</field>
       </record>
   </xmlsubhead1>
</xmlhead1>
My objective is to remove a "record" XML tag, when the field sub tag of that XML node contains DELETEKEY as the value.

So in the above XML file i will remove
   <record>
       <field>DELETEKEY</field>
       <field>World1</field>
   </record>

Solution Chosen: I tried to use GNU sed to solve the above problem:
Below is my code.

sed -n '

/<xmlhead1>/,/<\/xmlhead1>/{
    /<xmlsubhead1>/,/<\/xmlsubhead1>/{
        /<record>/,/<\/record>/{

            #Append to hold space
            H

            #if match DELETEKEY, start delete processing for the xml <record> element
            /<field>DELETEKEY<\/field>/{
                s/.*//g ; x
                b delete
            }

            #if you have reached the end tag of the <record> element,
            #print the hold space and clear the buffers
            /<\/record>/{
                g ; s/^\n//g; p
                s/.*//g ; x ; s/.*//g
            }

            #continue to next line
            b

            #delete processing
            :delete
            {
                #clear pattern space.
                s/.*//g

                #Read Next Line and remove new line(\n)
                N
                s/^\n//g

                #end delete processing when line matches the end tag </record>
                /<\/record>/b

                #else continue to get next line for delete process
                b delete
            }
        }
    }
}

#print all other lines
p

' $inputfile

The logic is as below:

Match Address Range beginning with <xmlhead1> and ending with </xmlhead1>

Match Inner Address Range beginning with <xmlsubhead1> and ending with </xmlsubhead1>

Match Inner Address Range <record> to </record>

When inside <record> tag,
(i) append all the lines to Hold space.
(ii) if the line matches DELETEKEY, then this record has to be deleted.Do step iii and iv. Else if no matches, go to step v
(iii) For delete, clear the hold space and jump to the delete branch
(iv) In the delete branch, read all the next lines using 'N' command till </record> is encountered. When </record> is encountered, exit out of the loop and start processing the next line.
(v) When not processing delete logic, if </record> is encountered, it means a block of <record> to </record> is successfully processed and is present in the hold space.
(vi) so take it out of hold space and print it.

Output of the above logic:

<xmlhead1>
   <xmlsubhead1>
       <record>
           <field>Hello</field>
           <field>World</field>
       </record>
</xmlhead1>

Problem in the output:
You can notice that the record element with the DELETEKEY is removed, but the </xmlsubhead1> tag is missing in the output.

Issue Debugging:
On debugging i found that after encountering a </record> line in the delete processing, inside the <record> to </record> range, the inner address range match should have ended since i have read and processed the </record> line.

But the <record> to </record> range block seems to process the </xmlsubhead1> line too.

I found this by adding the below code inside the command block of <record> range.

/<record>/,/<\/record>/{

    /<\/recordList>/{
        s/.*/record list is inside the record to record range/g
        p
    }

Can someone explain this behavior of sed, that a range match is exceeding the actual match ? In this case <record> to </record> range match is also matching </xmlsubhead1>

Upvotes: 2

Answers (2)

Thor

Reputation: 47229

I agree with the comments about using a proper XML parser.

The issue with your sed script, is that you read (N) lines in your :delete "function". Here is a working example that uses a simpler logic:

/<xmlhead1>/,/<\/xmlhead1>/{
  /<xmlsubhead1>/,/<\/xmlsubhead1>/{
    /<record>/ {
      :a
      N
      /<\/record>/!ba
      /<field>DELETEKEY<\/field>/d
    }
  }
}
p

I.e. when in the right context, read a full record (assuming simplified XML structure), if the record contains the offending text, delete it.

Upvotes: 0

choroba

Reputation: 242373

Don't use sed to edit XML, use an XML-aware tool. For example, in xsh, you can write:

open file.xml ;
delete //record[field="DELETEKEY"] ;
save :b ;

Upvotes: 1

sed - regex address range is matching lines outside the range when N command is used

Answers (2)

Related Questions