Reputation: 4470
I have a XML file with content like
<document> <section> <section SectionName="abstract"> <paragraph> <word Endpoint="1" SciomeSRIE_Sentence.ExposureSentence="1">gutkha</word> <word ExposureSentence="1">split_identifier ,</word> <word ExposureSentence="1">and</word> <word ExposureSentence="1">what</word> <word ExposureSentence="1">role</word> <word ExposureSentence="1">split_identifier ,</word> <word ExposureSentence="1">if</word> <word ExposureSentence="1">any</word> <word ExposureSentence="1">split_identifier ,</word> <word ExposureSentence="1">nicotine</word> <word ExposureSentence="1">contributes</word> <word ExposureSentence="1">to</word> <word ExposureSentence="1">the</word> <word ExposureSentence="1">effects</word> <word ExposureSentence="1">split_identifier .</word> <word EB_NLP_Tagger.Participant="3" AnimalGroupSentence="1" DoseGroupSentence="1" ExposureSentence="2">Adult</word> <word EB_NLP_Tagger.Participant="3" Sex="1" AnimalGroupSentence="1" DoseGroupSentence="1" ExposureSentence="2">male</word> <word EB_NLP_Tagger.Participant="3" Species="1" AnimalGroupSentence="1" DoseGroupSentence="1" ExposureSentence="2">mice</word> <word AnimalGroupSentence="1" DoseGroupSentence="1" ExposureSentence="2">were</word> <word AnimalGroupSentence="1" DoseGroupSentence="1" ExposureSentence="2">treated</word> <word AnimalGroupSentence="1" DoseGroupSentence="1" ExposureSentence="2">daily</word> <word AnimalGroupSentence="1" DoseGroupSentence="1" ExposureSentence="2">for</word>
I want to remove all occurences of "ExposureSentence" attribute. Output would be
<word Endpoint="1" SciomeSRIE_Sentence.ExposureSentence="1">gutkha</word> <word >split_identifier ,</word> <word >and</word> <word >what</word> <word >role</word> <word >split_identifier ,</word> <word >if</word> <word >any</word> <word >split_identifier ,</word> <word >nicotine</word> <word >contributes</word> <word >to</word> <word >the</word> <word >effects</word> <word >split_identifier .</word> <word EB_NLP_Tagger.Participant="3" AnimalGroupSentence="1" DoseGroupSentence="1" >Adult</word> <word EB_NLP_Tagger.Participant="3" Sex="1" AnimalGroupSentence="1" DoseGroupSentence="1" >male</word> <word EB_NLP_Tagger.Participant="3" Species="1" AnimalGroupSentence="1" DoseGroupSentence="1" >mice</word> <word AnimalGroupSentence="1" DoseGroupSentence="1" >were</word> <word AnimalGroupSentence="1" DoseGroupSentence="1" >treated</word> <word AnimalGroupSentence="1" DoseGroupSentence="1" >daily</word> <word AnimalGroupSentence="1" DoseGroupSentence="1" >for</word>
I tried following, but not sure how to proceed futher.
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(new ByteArrayInputStream(xml.getBytes()));
NodeList sectionNodeList = doc.getElementsByTagName("section");
for (int i = 0; i < sectionNodeList.getLength(); i++)
{
Node sectionNode = sectionNodeList.item(i);
}
Upvotes: 0
Views: 283
Reputation: 120704
XPath makes this straightforward:
public static void main(String... args)
throws Exception
{
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(new ByteArrayInputStream(xml.getBytes()));
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
// Find word elements with ExposureSentence attribute
XPathExpression query = xpath.compile("//word[@ExposureSentence]");
NodeList words = (NodeList) query.evaluate(doc, XPathConstants.NODESET);
for (int i = 0; i < words.getLength(); i++) {
// Remove the attribute
((Element) words.item(i)).removeAttribute("ExposureSentence");
}
// Handle ComponentName
query = xpath.compile("//ComponentName");
NodeList componentNames = (NodeList) query.evaluate(doc, XPathConstants.NODESET);
for (int i = 0; i < componentNames.getLength(); i++) {
String content = componentNames.item(i).getTextContent();
componentNames.item(i).setTextContent(
Arrays.stream(content.split(","))
.map(String::trim)
.filter(s -> !s.equals("ExposureSentence"))
.collect(Collectors.joining(", ")));
}
// Omitted: Save the XML
}
Upvotes: 2
Reputation: 8021
I think the simplest solution will be to replace all occurrences of ExposureSentence="1"
using a simple regex. Read all the xml contents as String and replace all the specific word occurrences where you do not need XML parsing and replacing.
In case of XML parsing, you have parse, manipulate the logic and you have to rebuild XML infoset.
Upvotes: -1