Adam Nelson
Adam Nelson

Reputation: 8090

Dealing with Windows line-endings in Python

I've got a 700MB XML file coming from a Windows provider.

As one might expect, the line endings are '\r\n' (or ^M in vi). What is the most efficient way to deal with this situation aside from getting the supplier to send over '\n' :-)

  1. Use os.linesep
  2. Use rstrip() (requiring opening the file ... which seems crazy)
  3. Using Universal newline support is not standard on my Mac Snow Leopard - so isn't an option.

I'm open to anything that requires Python 2.6+ but it needs to work on Snow Leopard and Ubuntu 9.10 with minimal external requirements. I don't mind a small performance penalty but I am looking for the standard best way to deal with this.

----edit----

The line endings are in the middle of the tag descriptors, otherwise they wouldn't be such a problem. I know this is bad form and that they shouldn't be sending this to me, but this is how I have the file and the vendor is mostly incompetent.

Upvotes: 0

Views: 12362

Answers (4)

dash-tom-bang
dash-tom-bang

Reputation: 17883

Are you opening the file in text mode or binary mode? I'm pretty sure I've counted on universal newlines on my Leopard install, but maybe I got an updated Python from somewhere too...

Anyway- I've seen this sort of thing biting many programmers in the bum, because they just reach for the 'b' key. Use a 't' if you're opening text files known to be created on your platform, 'U' instead of 't' if you need universal newlines.

with file(filename, 'rt') as f:
   content = f.read()

Edit: The comments note that 'rt' is the default. Fair point, but Python style tends to prefer explicit over implicit, so I'm going with that.

Upvotes: 2

John Machin
John Machin

Reputation: 83032

Allegedly: """This guy has \r\n right in the middle of tag descriptors like so: <ParentRedirec tSequenceID>""".

I see no \r\n here. Perhaps you mean repr(xml) contains things like

"<ParentRedirec\r\ntSequenceID>"

If not, try to say precisely what you mean, with repr-fashion examples.

The following should work:

>>> import re
>>> guff = """<atag>\r\n<bt\r\nag c="2">"""
>>> re.sub(r"(<[^>]*)\r\n([^>]*>)", r"\1\2", guff)
'<atag>\r\n<btag c="2">'
>>>

If there is more than one line break in a tag e.g. <foo\r\nbar\r\nzot> this will fix only the first. Alternatives (1) loop until the guff stops shrinking (2) write a smarter regexp yourself :-)

Upvotes: 1

Thomas Wouters
Thomas Wouters

Reputation: 133503

Why are the DOS line-endings a problem? Most things can deal with them just fine, including XML parsers. If you really want to get rid of them, open the file in universal line-endings mode:

open(filename, 'rU')

Python will convert all line-endings to UNIX line-endings for you. If you really can't use that (which I find a little surprising), there's no way to get Python to do the work for you. You will have to open the file regardless, though, so your objection to #2 seems a little odd.

Upvotes: 6

Alexander Lebedev
Alexander Lebedev

Reputation: 6044

What are you trying to do with this file? Whitespace between tags is usually ignored in XML, so the only place where line endings matter tags' content.

Upvotes: 0

Related Questions