user1631306
user1631306

Reputation: 4470

remove all occurences of a specific attribute from a XML

I have a XML file with content like

<document>
  <section>
    <section SectionName="abstract">
     <paragraph>
    <word Endpoint="1" SciomeSRIE_Sentence.ExposureSentence="1">gutkha</word>
    <word ExposureSentence="1">split_identifier ,</word>
    <word ExposureSentence="1">and</word>
    <word ExposureSentence="1">what</word>
    <word ExposureSentence="1">role</word>
    <word ExposureSentence="1">split_identifier ,</word>
    <word ExposureSentence="1">if</word>
    <word ExposureSentence="1">any</word>
    <word ExposureSentence="1">split_identifier ,</word>
    <word ExposureSentence="1">nicotine</word>
    <word ExposureSentence="1">contributes</word>
    <word ExposureSentence="1">to</word>
    <word ExposureSentence="1">the</word>
    <word ExposureSentence="1">effects</word>
    <word ExposureSentence="1">split_identifier .</word>
    <word EB_NLP_Tagger.Participant="3" AnimalGroupSentence="1" DoseGroupSentence="1" ExposureSentence="2">Adult</word>
    <word EB_NLP_Tagger.Participant="3" Sex="1" AnimalGroupSentence="1" DoseGroupSentence="1" ExposureSentence="2">male</word>
    <word EB_NLP_Tagger.Participant="3" Species="1" AnimalGroupSentence="1" DoseGroupSentence="1" ExposureSentence="2">mice</word>
    <word AnimalGroupSentence="1" DoseGroupSentence="1" ExposureSentence="2">were</word>
    <word AnimalGroupSentence="1" DoseGroupSentence="1" ExposureSentence="2">treated</word>
    <word AnimalGroupSentence="1" DoseGroupSentence="1" ExposureSentence="2">daily</word>
    <word AnimalGroupSentence="1" DoseGroupSentence="1" ExposureSentence="2">for</word>

I want to remove all occurences of "ExposureSentence" attribute. Output would be

  <word Endpoint="1" SciomeSRIE_Sentence.ExposureSentence="1">gutkha</word>
    <word >split_identifier ,</word>
    <word >and</word>
    <word >what</word>
    <word >role</word>
    <word >split_identifier ,</word>
    <word >if</word>
    <word >any</word>
    <word >split_identifier ,</word>
    <word >nicotine</word>
    <word >contributes</word>
    <word >to</word>
    <word >the</word>
    <word >effects</word>
    <word >split_identifier .</word>
    <word EB_NLP_Tagger.Participant="3" AnimalGroupSentence="1" DoseGroupSentence="1" >Adult</word>
    <word EB_NLP_Tagger.Participant="3" Sex="1" AnimalGroupSentence="1" DoseGroupSentence="1" >male</word>
    <word EB_NLP_Tagger.Participant="3" Species="1" AnimalGroupSentence="1" DoseGroupSentence="1" >mice</word>
    <word AnimalGroupSentence="1" DoseGroupSentence="1" >were</word>
    <word AnimalGroupSentence="1" DoseGroupSentence="1" >treated</word>
    <word AnimalGroupSentence="1" DoseGroupSentence="1" >daily</word>
    <word AnimalGroupSentence="1" DoseGroupSentence="1" >for</word>

I tried following, but not sure how to proceed futher.

        DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
        Document doc = dBuilder.parse(new ByteArrayInputStream(xml.getBytes()));
        NodeList sectionNodeList = doc.getElementsByTagName("section");
        for (int i = 0; i < sectionNodeList.getLength(); i++)
        {
            Node sectionNode = sectionNodeList.item(i);

        }

Upvotes: 0

Views: 283

Answers (2)

Sean Bright
Sean Bright

Reputation: 120704

XPath makes this straightforward:

public static void main(String... args)
        throws Exception
{
    DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
    Document doc = dBuilder.parse(new ByteArrayInputStream(xml.getBytes()));

    XPathFactory xPathfactory = XPathFactory.newInstance();
    XPath xpath = xPathfactory.newXPath();

    // Find word elements with ExposureSentence attribute
    XPathExpression query = xpath.compile("//word[@ExposureSentence]");
    NodeList words = (NodeList) query.evaluate(doc, XPathConstants.NODESET);
    for (int i = 0; i < words.getLength(); i++) {
        // Remove the attribute
        ((Element) words.item(i)).removeAttribute("ExposureSentence");
    }

    // Handle ComponentName
    query = xpath.compile("//ComponentName");
    NodeList componentNames = (NodeList) query.evaluate(doc, XPathConstants.NODESET);
    for (int i = 0; i < componentNames.getLength(); i++) {
        String content = componentNames.item(i).getTextContent();
        componentNames.item(i).setTextContent(
            Arrays.stream(content.split(","))
                .map(String::trim)
                .filter(s -> !s.equals("ExposureSentence"))
                .collect(Collectors.joining(", ")));
    }

    // Omitted: Save the XML
}

Upvotes: 2

Sambit
Sambit

Reputation: 8021

I think the simplest solution will be to replace all occurrences of ExposureSentence="1" using a simple regex. Read all the xml contents as String and replace all the specific word occurrences where you do not need XML parsing and replacing.

In case of XML parsing, you have parse, manipulate the logic and you have to rebuild XML infoset.

Upvotes: -1

Related Questions