Reputation: 46

Python: Reading a file by using \n as the newline character. File also contains \r\n

I'm looking at a .CSV-file that looks like this:

Hello\r\n
my name is Alex\n
Hello\r\n
my name is John?\n

I'm trying to open the file with the newline-Character defined as '\n':

with open(outputfile, encoding="ISO-8859-15", newline='\n') as csvfile:

I get:

line1 = 'Hello'
line2 = 'my name is Alex'
line3 = 'Hello'
line4 = 'my name is John'

My desired result is:

line1 = 'Hello\r\nmy name is Alex'
line2 = 'Hello\r\nmy name is John'

Do you have any suggestions on how to fix this? Thank you in advance!

Upvotes: 0

Answers (3)

facehugger

Reputation: 408

From documentation of the built-in function open in the standard library:

When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.

File object itself cannot explicitly distinguish data bytes (in your case) '\r\n' from separator '\n' - this is an authority of the bytes decoder. So, probably, as one of the options, it is possible to write your own decoder and use associated encoding as encoding of your text file. But this is a bit tedious and in case of small files it's much easier to use a more straightforward approach, using re module. The solution proposed by @Martijn Pieters should be used to iterate large files.

import re

with open('data.csv', 'tr', encoding="ISO-8859-15", newline='') as f:
    file_data = f.read()

# Approach 1:
lines1 = re.split(r'(?<!\r)\n', file_data)
if not lines1[-1]:
    lines1.pop()
# Approach 2:
lines2 = re.findall(r'(?:.+?(?:\r\n)?)+', file_data)
# Approach 3:
iterator_lines3 = map(re.Match.group, re.finditer(r'(?:.+?(?:\r\n)?)+', file_data))

assert lines1 == lines2 == list(iterator_lines3)
print(lines1)

If we need '\n' at the end of each line:

# Approach 1:
nlines1 = re.split(r'(?<!\r\n)(?<=\n)', file_data)
if not nlines1[-1]:
    nlines1.pop()
# Approach 2:
nlines2 = re.findall(r'(?:.+?(?:\r\n)?)+\n?', file_data)
# Approach 3:
iterator_nlines3 = map(re.Match.group, re.finditer(r'(?:.+?(?:\r\n)?)+\n', file_data))

assert nlines1 == nlines2 == list(iterator_nlines3)
print(nlines1)

Results:

['Hello\r\nmy name is Alex', 'Hello\r\nmy name is John']
['Hello\r\nmy name is Alex\n', 'Hello\r\nmy name is John\n']

Upvotes: 1

Alex

Reputation: 46

I'm sure your answers are completely correct and technically advanced. Sadly the CSV-File is not at all RFC 4180 compliant.

Therefore i'm going with the following solution and correct my temporary characters "||" afterwards:

with open(outputfile_corrected, 'w') as correctedfile_handle:
    with open(outputfile, encoding="ISO-8859-15", newline='') as csvfile:
        csvfile_content = csvfile.read()
        csvfile_content_new = csvfile_content.replace('\r\n', '||')
    correctedfile_handle.write(csvfile_content_new)

(Someone commented this, but answer has been deleted)

Upvotes: 1

Martijn Pieters

Reputation: 1121924

Python's line splitting algorithm can't do what you want; lines that end in \r\n also end in \r. At most you can set the newline argument to either '\n' or '' and re-join lines if they end in \r\n instead of \n. You can use a generator function to do that for you:

def collapse_CRLF(fileobject):
    buffer = []
    for line in fileobject:
        if line.endswidth('\r\n'):
            buffer.append(line)
        else:
            yield ''.join(buffer) + line
            buffer = []
   if buffer:
       yield ''.join(buffer)

then use this as:

with collapse_CRLF(open(outputfile, encoding="ISO-8859-15", newline='')) as csvfile:

However, if this is CSV file, then you really want to use the csv module. It handles files with a mix of \r\n and \n endings for you as it knows how to preserve bare newlines in RFC 4180 CSV files, already:

import csv

with open(outputfile, encoding="ISO-8859-15", newline='') as inputfile:
    reader = csv.reader(inputfile)

Note that in a valid CSV file, \r\n is the delimiter between rows, and \n is valid in column values. So if you did not want to use the csv module here for whatever reason, you'd still want to use newline='\r\n'.

Upvotes: 0

Python: Reading a file by using \n as the newline character. File also contains \r\n

Answers (3)

Related Questions