Reputation: 365

pandas - newline char splitting row in multiple rows while reading and writing csv

My task is to read a CSV file from one location, do some manipulation in memory in dataframe and then place file at some other location.

The source file is '||' seperated, and target file has to be "," seperated.

I have do this for multiple files, with different columns.

In one of the source csv, one of the column contains new line char within the column.

example source CSV file:

id||notes<CR><LF>
1||notesLine1<CR><LF>
2||notesLine1<CR><LF>
notesLine2<CR><LF>
3||notesLine1: notesLine2<CR><LF>

note that line seperator is also and new line chars within the column 'note' is also . I cannot change the source, however I can have a mid layer in memory or disk if any modification is required.

code:

...
df_target = pd.read_csv(source_file, dtype = None, parse_dates= True, keep_default_na= False,header=None,sep="\|\|",engine='python', encoding='utf-8'))

df_target.to_csv(target_file,header=header_list,index=False,quoting=csv.QUOTE_ALL)
...

current output:

"id","notes"<CR><LF>
"1","notesLine1"<CR><LF>
"2","notesLine1"<CR><LF>
"notesLine2",""<CR><LF>      -- extra unwanted row being created
"3","notesLine1: notesLine2"<CR><LF>

note the row is split into two, amking total rows to have 4 rows. I dont want this to happen!

expected output:

"id","notes"<CR><LF>
"1","notesLine1"<CR><LF>
"2","notesLine1 \n notesLine2",""<CR><LF>
"3","notesLine1: notesLine2"<CR><LF>

note: instead of split into two rows, I can have '\n' and data within same row. so that total rows are 3 and not 4.

Is there a way that this can be handled?

Upvotes: 0

Answers (2)

Subasri sridhar

Reputation: 831

See if this helps :

with open("sample.csv", 'r+') as file:
    text = str();
    for line in file:
        
        if line[0].isdigit() == True:
            text = "{}\n{}".format(text, line.strip())
        else:
            text = "{} {}".format(text, line.strip())
    file.seek(0);
    file.write(text[1:])

Sample File input and output Screenshots 1

Upvotes: 0

Subasri sridhar

Reputation: 831

CR and LF are control characters, respectively coded 0x0D (13 decimal) and 0x0A (10 decimal).

They are used to mark a line break in the file.

Upvotes: 1

pandas - newline char splitting row in multiple rows while reading and writing csv

Answers (2)

Related Questions