gary69
gary69

Reputation: 4240

Python remove &#xD from XML

I'm extracting text from an XML file and printing it to a text file using python. Some lines in the xml file have '&#xD' and '&#xA' in them which cause the line to be output to the text file with carriage returns and line feeds. There are answers here Ruby remove 
   and here https://stackoverflow.com/questions/28794365/remove-xd-from-xml on how to remove these characters in Ruby and PHP so that there are no line breaks. How do I do this in Python. Here is my code

with open("xmlfile") as f:
    doc = parse(f)
    str = doc.getElementsByTagName("informations")[0].getAttribute("text")
    print(str)
    str = str.replace("
", " ").replace("
", " ")
    print(str)

Here is the string in the xml file

"An Airport Contact Method, Is Alter must be one of the following:
- "T" or "F" (boolean true or false) or empty" language="en"

Output:

An Airport Contact Method, Is Alter must be one of the following:
- "T" or "F" (boolean true or false) or empty
An Airport Contact Method, Is Alter must be one of the following:
- "T" or "F" (boolean true or false) or empty

Upvotes: 2

Views: 4157

Answers (1)

Dan Field
Dan Field

Reputation: 21661

By the time whatever XML library you're using has parsed it, it's already resolved the entities.

Replace

str = str.replace("
", " ").replace("
", " ")

with

str = str.replace("\r", " ").replace("\n", " ")

Per @martineau's suggestion, if you're ever not sure what character an XML entity is resolving to you can try print(repr(str)) to get a better picture of what the string actually contains once it's been parsed.

Upvotes: 3

Related Questions