Reputation: 8090
I've got a 700MB XML file coming from a Windows provider.
As one might expect, the line endings are '\r\n' (or ^M in vi). What is the most efficient way to deal with this situation aside from getting the supplier to send over '\n' :-)
I'm open to anything that requires Python 2.6+ but it needs to work on Snow Leopard and Ubuntu 9.10 with minimal external requirements. I don't mind a small performance penalty but I am looking for the standard best way to deal with this.
----edit----
The line endings are in the middle of the tag descriptors, otherwise they wouldn't be such a problem. I know this is bad form and that they shouldn't be sending this to me, but this is how I have the file and the vendor is mostly incompetent.
Upvotes: 0
Views: 12362
Reputation: 17883
Are you opening the file in text mode or binary mode? I'm pretty sure I've counted on universal newlines on my Leopard install, but maybe I got an updated Python from somewhere too...
Anyway- I've seen this sort of thing biting many programmers in the bum, because they just reach for the 'b' key. Use a 't' if you're opening text files known to be created on your platform, 'U' instead of 't' if you need universal newlines.
with file(filename, 'rt') as f:
content = f.read()
Edit: The comments note that 'rt' is the default. Fair point, but Python style tends to prefer explicit over implicit, so I'm going with that.
Upvotes: 2
Reputation: 83032
Allegedly: """This guy has \r\n right in the middle of tag descriptors like so: <ParentRedirec tSequenceID>
""".
I see no \r\n
here. Perhaps you mean repr(xml) contains things like
"<ParentRedirec\r\ntSequenceID>"
If not, try to say precisely what you mean, with repr-fashion examples.
The following should work:
>>> import re
>>> guff = """<atag>\r\n<bt\r\nag c="2">"""
>>> re.sub(r"(<[^>]*)\r\n([^>]*>)", r"\1\2", guff)
'<atag>\r\n<btag c="2">'
>>>
If there is more than one line break in a tag e.g. <foo\r\nbar\r\nzot>
this will fix only the first. Alternatives (1) loop until the guff stops shrinking (2) write a smarter regexp yourself :-)
Upvotes: 1
Reputation: 133503
Why are the DOS line-endings a problem? Most things can deal with them just fine, including XML parsers. If you really want to get rid of them, open the file in universal line-endings
mode:
open(filename, 'rU')
Python will convert all line-endings to UNIX line-endings for you. If you really can't use that (which I find a little surprising), there's no way to get Python to do the work for you. You will have to open the file regardless, though, so your objection to #2 seems a little odd.
Upvotes: 6
Reputation: 6044
What are you trying to do with this file? Whitespace between tags is usually ignored in XML, so the only place where line endings matter tags' content.
Upvotes: 0