ale19
ale19

Reputation: 1367

Read XML characters as strings into ElementTree

I'm using ElementTree to compare a CSV file to an XML document. The script should update the tags if the tag matches the first cell in the CSV. The tag needs to have a non-breaking space to prevent the text from wrapping when I import the XML into a different program (InDesign).

XML Input:

<Table_title>fatal crashes by&#160;time of day</Table_title>
<cell>data1</cell>
<cell>data2</cell>
<cell>data3</cell>

CSV input:

'fatal crashes by&#160;time of day', data1, data2, data3

However, when I read the XML into the ElementTree script using ET.parse('file.xml'), it seems to render the character a non-breaking space:

<Table_title>fatal crashes by time of day</Table_title>
<cell>data1</cell>
<cell>data2</cell>
<cell>data3</cell>

Which is exactly what it should do (I think). But in this scenario, I actually want &#160; to render as a string, so that it matches the first cell of the CSV (because when the CSV is read in, it interprets it as a string: 'fatal crashes by&#160;time of day').

Is there a way to:

  1. Force the XML script to read the non-breaking space as a string instead of an escaped character: <Table_title>fatal crashes by&#160;time of day</Table_title>

or

  1. Force the XML script to read the CSV and render the character as an escaped character instead of a string: 'fatal crashes by time of day', data1, data2, data3

Upvotes: 1

Views: 249

Answers (1)

Tomalak
Tomalak

Reputation: 338406

Here is what happens.

You read this XML into ElementTree:

<Table_title>fatal crashes by&#160;time of day</Table_title>

ElementTree parses it and turns it into this DOM:

  • element node, name Table_title
    • text node, string value: "fatal crashes by・time of day" (where is to represent the character with code 160, i.e. the non-breaking space)

This is 100% correct and you can't (and should not want to) do anything about it.

Your CSV also appears to contain a snippet of XML in its first column. However, it remains un-parsed until you parse it.

If you want to be able to compare the text values, you have no choice but to XML-parse the first column.

import csv
import xml.etree.ElementTree as ET

# open your XML and CSV files...

for row in csv_reader:
    temp = ET.fromstring('<temp>' + row[0] + '</temp>')
    print(temp.text)

    # compare temp.text to your XML 

Upvotes: 2

Related Questions