amphibient
amphibient

Reputation: 31258

Simple dom traversing in Python using xml.etree.ElementTree

E.g. consider parsing a pom.xml file:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

    <parent>
        <groupId>com.parent</groupId>
        <artifactId>parent</artifactId>
        <version>1.0-SNAPSHOT</version>
        <relativePath>../pom.xml</relativePath>
    </parent>

    <modelVersion>2.0.0</modelVersion>
    <groupId>com.parent.somemodule</groupId>
    <artifactId>some_module</artifactId>
    <packaging>jar</packaging>
    <version>1.0-SNAPSHOT</version>
    <name>Some Module</name>
    ...

Code:

import xml.etree.ElementTree as ET

tree = ET.parse(pom)
root = tree.getroot()

groupId = root.find("groupId")
artifactId = root.find("artifactId")

Both groupId and artifactId are None. Why when they are the direct descendants of the root? I tried to replace the root with tree (groupId = tree.find("groupId")) but that didn't change anything.

Upvotes: 1

Views: 1665

Answers (2)

abarnert
abarnert

Reputation: 365915

The problem is that you don't have a child named groupId, you have a child named {http://maven.apache.org/POM/4.0.0}groupId, because etree doesn't ignore XML namespaces, it uses "universal names". See Working with Namespaces and Qualified Names in the effbot docs.

Upvotes: 4

Sean K.
Sean K.

Reputation: 45

Just to expand on abarnert's comment about BeautifulSoup, if you DO just want a quick and dirty solution to the problem, this is probably the fastest way to go about it. I have implemented this (for a personal script) that uses bs4, where you can traverse the tree with

element = dom.getElementsByTagNameNS('*','elementname')

This will reference the dom using ANY namespace, handy if you know you've only got one in the file so there's no ambiguity.

Upvotes: 1

Related Questions