user2549803
user2549803

Reputation: 343

Extract surrounding XML Tags by child content

I have an XML file that basically looks like this:

<products xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <Product Id="1">
      <Product Id="1_1">
        <Attribute Name="Whatever"></Attribute>
      </Product>
      <Attribute Name="Identifier">NumberOne</Attribute>
  </Product>
  <Product Id="2">
      <Attribute Name="Identifier">NumberTwo</Attribute>
  </Product>
</products>

What I want to do is to extract the complete products. Product Node by searching for

<Attribute Name="Identifier">SEARCH_TEXT</Attribute>

So for example, for NumberOne I would get the surrounding Product (Id="1") tags and their content.

Example: for the search text "NumberOne" the desired result is:

<Product Id="1">
      <Product Id="1_1">
        <Attribute Name="Whatever"></Attribute>
      </Product>
      <Attribute Name="Identifier">NumberOne</Attribute>
  </Product>

for the search text "NumberTwo" it would be

<Product Id="2">
      <Attribute Name="Identifier">NumberTwo</Attribute>
  </Product>

What I tried is this regex (Python):

<Product ((?!</Product>)[\S|\s])*<Attribute Name=\"Identifier\">NumberOne</Attribute>((?!</Product>)[\S|\s])*</Product>

But this does dot work because of the nested Products. Does anyone have a hint for solving this?

I read that regex is not the smartest approach for these kinds of XML searching problems. In reality the topLevel Products are weigh more complex, and I need to merge two XML files that look like my example. So I was hoping by using regex I could solve this on "string" level rather than on XML Parser level where I might need to prepare those complex objects before generating the final XML output. Just find the topLevel Product by that Identifier value, and grab them completely - no matter what they contain otherwise.

Thanks a lot.

UPDATE: Based on Jack Fleeting's solution - this is what I ended up using (XPath):

//products//Product[Attribute[@Name="Identifier" and text()="NumberOne"]]

Upvotes: 0

Views: 80

Answers (1)

Jack Fleeting
Jack Fleeting

Reputation: 24928

It is indeed not a good idea to try to parse xml with regex. Using xpath should get you there, assuming I understand you correctly. For example,

//Product[.//*[.="NumberOne"]]

should output:

<Product Id="1">
      <Product Id="1_1">
        <Attribute Name="Whatever"/>
      </Product>
      <Attribute Name="Identifier">NumberOne</Attribute>
  </Product>

etc.

Upvotes: 1

Related Questions