Reputation: 46
I'm looking at a .CSV-file that looks like this:
Hello\r\n
my name is Alex\n
Hello\r\n
my name is John?\n
I'm trying to open the file with the newline-Character defined as '\n':
with open(outputfile, encoding="ISO-8859-15", newline='\n') as csvfile:
I get:
line1 = 'Hello'
line2 = 'my name is Alex'
line3 = 'Hello'
line4 = 'my name is John'
My desired result is:
line1 = 'Hello\r\nmy name is Alex'
line2 = 'Hello\r\nmy name is John'
Do you have any suggestions on how to fix this? Thank you in advance!
Upvotes: 0
Views: 1262
Reputation: 408
From documentation of the built-in function open in the standard library:
When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.
File object itself cannot explicitly distinguish data bytes (in your case) '\r\n'
from separator '\n'
- this is an authority of the bytes decoder. So, probably, as one of the options, it is possible to write your own decoder and use associated encoding as encoding
of your text file. But this is a bit tedious and in case of small files it's much easier to use a more straightforward approach, using re module. The solution proposed by @Martijn Pieters should be used to iterate large files.
import re
with open('data.csv', 'tr', encoding="ISO-8859-15", newline='') as f:
file_data = f.read()
# Approach 1:
lines1 = re.split(r'(?<!\r)\n', file_data)
if not lines1[-1]:
lines1.pop()
# Approach 2:
lines2 = re.findall(r'(?:.+?(?:\r\n)?)+', file_data)
# Approach 3:
iterator_lines3 = map(re.Match.group, re.finditer(r'(?:.+?(?:\r\n)?)+', file_data))
assert lines1 == lines2 == list(iterator_lines3)
print(lines1)
If we need '\n'
at the end of each line:
# Approach 1:
nlines1 = re.split(r'(?<!\r\n)(?<=\n)', file_data)
if not nlines1[-1]:
nlines1.pop()
# Approach 2:
nlines2 = re.findall(r'(?:.+?(?:\r\n)?)+\n?', file_data)
# Approach 3:
iterator_nlines3 = map(re.Match.group, re.finditer(r'(?:.+?(?:\r\n)?)+\n', file_data))
assert nlines1 == nlines2 == list(iterator_nlines3)
print(nlines1)
Results:
['Hello\r\nmy name is Alex', 'Hello\r\nmy name is John']
['Hello\r\nmy name is Alex\n', 'Hello\r\nmy name is John\n']
Upvotes: 1
Reputation: 46
I'm sure your answers are completely correct and technically advanced. Sadly the CSV-File is not at all RFC 4180 compliant.
Therefore i'm going with the following solution and correct my temporary characters "||" afterwards:
with open(outputfile_corrected, 'w') as correctedfile_handle:
with open(outputfile, encoding="ISO-8859-15", newline='') as csvfile:
csvfile_content = csvfile.read()
csvfile_content_new = csvfile_content.replace('\r\n', '||')
correctedfile_handle.write(csvfile_content_new)
(Someone commented this, but answer has been deleted)
Upvotes: 1
Reputation: 1121924
Python's line splitting algorithm can't do what you want; lines that end in \r\n
also end in \r
. At most you can set the newline
argument to either '\n'
or ''
and re-join lines if they end in \r\n
instead of \n
. You can use a generator function to do that for you:
def collapse_CRLF(fileobject):
buffer = []
for line in fileobject:
if line.endswidth('\r\n'):
buffer.append(line)
else:
yield ''.join(buffer) + line
buffer = []
if buffer:
yield ''.join(buffer)
then use this as:
with collapse_CRLF(open(outputfile, encoding="ISO-8859-15", newline='')) as csvfile:
However, if this is CSV file, then you really want to use the csv
module. It handles files with a mix of \r\n
and \n
endings for you as it knows how to preserve bare newlines in RFC 4180 CSV files, already:
import csv
with open(outputfile, encoding="ISO-8859-15", newline='') as inputfile:
reader = csv.reader(inputfile)
Note that in a valid CSV file, \r\n
is the delimiter between rows, and \n
is valid in column values. So if you did not want to use the csv
module here for whatever reason, you'd still want to use newline='\r\n'
.
Upvotes: 0