Reputation: 343
I have an XML file that basically looks like this:
<products xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Product Id="1">
<Product Id="1_1">
<Attribute Name="Whatever"></Attribute>
</Product>
<Attribute Name="Identifier">NumberOne</Attribute>
</Product>
<Product Id="2">
<Attribute Name="Identifier">NumberTwo</Attribute>
</Product>
</products>
What I want to do is to extract the complete products. Product Node by searching for
<Attribute Name="Identifier">SEARCH_TEXT</Attribute>
So for example, for NumberOne I would get the surrounding Product (Id="1") tags and their content.
Example: for the search text "NumberOne" the desired result is:
<Product Id="1">
<Product Id="1_1">
<Attribute Name="Whatever"></Attribute>
</Product>
<Attribute Name="Identifier">NumberOne</Attribute>
</Product>
for the search text "NumberTwo" it would be
<Product Id="2">
<Attribute Name="Identifier">NumberTwo</Attribute>
</Product>
What I tried is this regex (Python):
<Product ((?!</Product>)[\S|\s])*<Attribute Name=\"Identifier\">NumberOne</Attribute>((?!</Product>)[\S|\s])*</Product>
But this does dot work because of the nested Products. Does anyone have a hint for solving this?
I read that regex is not the smartest approach for these kinds of XML searching problems. In reality the topLevel Products are weigh more complex, and I need to merge two XML files that look like my example. So I was hoping by using regex I could solve this on "string" level rather than on XML Parser level where I might need to prepare those complex objects before generating the final XML output. Just find the topLevel Product by that Identifier value, and grab them completely - no matter what they contain otherwise.
Thanks a lot.
UPDATE: Based on Jack Fleeting's solution - this is what I ended up using (XPath):
//products//Product[Attribute[@Name="Identifier" and text()="NumberOne"]]
Upvotes: 0
Views: 80
Reputation: 24928
It is indeed not a good idea to try to parse xml with regex. Using xpath should get you there, assuming I understand you correctly. For example,
//Product[.//*[.="NumberOne"]]
should output:
<Product Id="1">
<Product Id="1_1">
<Attribute Name="Whatever"/>
</Product>
<Attribute Name="Identifier">NumberOne</Attribute>
</Product>
etc.
Upvotes: 1