rgk
rgk

Reputation: 866

XML parsing using Python minidom when the XML has special characters

I have an XML file that needs to have a "TAB" character as a value to a key. Based on this link Represent space and tab in XML tag I encoded it as &#009 rather than use "\t" as it was interpreting it as string containing two characters '\' and 't'.

I did not use the CDATA section as that would still consider a TAB as a string containing two characters '\' and 't'

The sample XML file of my use case looks like this

<?xml version="1.0" encoding="UTF-8"?>
<keys>
    <key>
        <name>key1</name>
        <value>value1</value>
    </key>
    <key>
        <name>key2</name>
        <value>&#009;</value>                                                                    
    </key>
    <key>
        <name>key3</name>
        <value>2048</value>
    </key>
</keys>

This is the code that I have right now that is not able to handle this TAB character

...
dom_obj = minidom.parse(self.path_to_xml)
...
for each_key_child in key_child:
    if each_key_child.nodeType == Node.ELEMENT_NODE:
        if each_key_child.nodeName == 'name':
            node_name = str(each_key_child.childNodes[0].data.strip())
        elif each_key_child.nodeName == 'value':
            node_value = str(each_key_child.childNodes[0].data.strip())
        else:
            pass
    else:
        pass

The output that I get after the script is executed is

'key1': 'value1',
'key2': '',
'key3': '2048',

But when I execute it on the Python interactive interpreter

mobj = minidom.parse(path_to_xml_file)
mobj.getElementsByTagName("value")[1].childNodes[0]

I get the following output

<DOM Text node "u'\t'">

But I am not able to assign the output to a variable. This step is not working

node = mobj.getElementsByTagName("value")[1].childNodes[0].data

But another strange thing is that when I just say node at the interpreter it is printing '\t' !!

node
u'\t'

To see if this was a genuine case where the TAB character was getting stored in the variable but not getting displayed I used it as a separator to concatenate two strings.

This works fine at the interpreter but not in the script the output of which I saw on vim through the :set list option

Can anyone tell me what is wrong with the approach taken by me. Help appreciated!

Upvotes: 1

Views: 940

Answers (1)

abarnert
abarnert

Reputation: 365707

You're calling strip(). This strips tabs. Just don't do that. (Or, if you need to strip spaces or newlines or something specific, but leave tabs, call it with a specific argument, like strip('\n').)

Here's a demonstration (faked, because your example XML isn't valid, so I can't test it):

>>> mobj.getElementsByTagName("value")[1].childNodes[0]
<DOM Text node "u'\t'">
>>> mobj.getElementsByTagName("value")[1].childNodes[0].data
u'\t'
>>> mobj.getElementsByTagName("value")[1].childNodes[0].data.strip()
u''
>>> mobj.getElementsByTagName("value")[1].childNodes[0].data.strip('\n')
u'\t'

Upvotes: 3

Related Questions