kevin masliah
kevin masliah

Reputation: 13

clean csv file with python


I have a csv file that i'm trying to clean up with python.
it has lines separated by << \n >> or empty lines.
I would like each line that does not end with << " >> to be cut/pasted to the previous line.
here is a concrete example to be more explicit!\

CSV FILE I HAVE

*"id","name","age","city","remark"\
"1","kevin","27","paris","This is too bad"\
"8","angel","18","london","Incredible !!!"\
"14","maria","33","madrid","i can't believe it."\
"16","john","28","new york","hey men,\
\nhow do you did this"\
"22","naima","35","istanbul","i'm sure it's false,\
\
\nit can't be real"
"35","marco","26","roma","you'r my hero!"\
"39","lili","37","tokyo","all you need to knows.\
\n\nthe best way to upgrade easely"\
...*

CSV FILE I WOULD LIKE TO HAVE

*"id","name","age","city","remark"\
"1","kevin","27","paris","This is too bad"\
"8","angel","18","london","Incredible !!!"\
"14","maria","33","madrid","i can't believe it."\
"16","john","28","new york","hey men,how do you did this"\
"22","naima","35","istanbul","i'm sure it's false, it can't be real"\
"35","marco","26","roma","you'r my hero!"\
"39","lili","37","tokyo","all you need to knows. the best way to upgrade easely"\
...*

someone would be how to do?
thank you in advance for your help !

i'm actually try this python code -->

text = open("input.csv", "r", encoding='utf-8') 
  
text = ''.join([i for i in text])  
  
text = text.replace("\\n", "")
 
x = open("output.csv","w") 
  
x.writelines(text) 
x.close()

Upvotes: 0

Views: 2592

Answers (3)

gremur
gremur

Reputation: 1690

input.csv file content:

"id","name","age","city","remark"
"1","kevin","27","paris","This is too bad"
"8","angel","18","london","Incredible !!!"
"14","maria","33","madrid","i can't believe it."
"16","john","28","new york","hey men,
how do you did this"
"22","naima","35","istanbul","i'm sure it's false,

nit can't be real"
"35","marco","26","roma","you'r my hero!"
"39","lili","37","tokyo","all you need to knows.

the best way to upgrade easely"

Possible (quick and simple) solution is the following:

with open('input.csv', 'r', encoding='utf-8') as file:
    data = file.read()
    
clean_data = data.replace('"\n"', '"||"').replace("\n", "").replace('"||"', '"\n"')
    
with open('output.csv', 'w', encoding='utf-8') as file:
    file.write(clean_data)

Returns output.csv content:

"id","name","age","city","remark"
"1","kevin","27","paris","This is too bad"
"8","angel","18","london","Incredible !!!"
"14","maria","33","madrid","i can't believe it."
"16","john","28","new york","hey men,how do you did this"
"22","naima","35","istanbul","i'm sure it's false,nit can't be real"
"35","marco","26","roma","you'r my hero!"
"39","lili","37","tokyo","all you need to knows.the best way to upgrade easely"

Upvotes: 0

Martin Evans
Martin Evans

Reputation: 46759

There are a few points to make here:

  1. Your CSV files contains , characters inside your remarks. This means the field must be enclosed in quotes (which it is).

  2. A CSV file is allowed to contain a newline within a single field. This does not result in an extra data row, but it does make the file weird to read for a human.

  3. Python's CSV reader will automatically handle newlines in fields.

  4. Finally, your data appears to be encoded strangely and you wish to remove all the extra newlines. Each row also has a trailing backslash character that should not be there.

I suggest the following approach:

  1. Use Python's CSV reader to correctly read one row at a time (you have 7 rows + a header).
  2. Remove any newlines from the remark field.

For example:

import csv

with open('input.csv') as f_input, open('output.csv', 'w', newline='') as f_output:
    csv_input = csv.reader(f_input)
    csv_output = csv.writer(f_output)

    for row in csv_input:
        if len(row) == 5:       # skip blank lines
            row[4] = row[4].replace('\n', '').replace('\\n', ' ').replace('\\', '')
            csv_output.writerow(row)

This would give you:

id,name,age,city,remark\
1,kevin,27,paris,This is too bad
8,angel,18,london,Incredible !!!
14,maria,33,madrid,i can't believe it.
16,john,28,new york,"hey men, how do you did this"
22,naima,35,istanbul,"i'm sure it's false, it can't be real"
35,marco,26,roma,you'r my hero!
39,lili,37,tokyo,all you need to knows.  the best way to upgrade easely

Upvotes: 1

Lumina
Lumina

Reputation: 103

for this_row in read_file.readlines():
    if not this_row.startswith('"'):
        prev_row = prev_row.rstrip('\n') + this_row
    else:
        write_file.write(prev_row)
        prev_row = this_row

Just a draft. You can use str.join with list-cache to gain enhancement

Upvotes: 1

Related Questions