Reputation: 53
I am trying to split large csv files in smaller chunks and load the data to sql for further analyse on the chunks. But when I am running the following code, the text qualifiers are getting misplaced and it is hampering the csv file and we are unable to load the data:
import csv
divisor = 500000
outfileno = 1
outfile = None
with open('mock_data.txt', 'r') as infile:
infile_iter = csv.reader(infile)
header = next(infile_iter)
for index, row in enumerate(infile_iter):
if index % divisor == 0:
if outfile is not None:
outfile.close()
outfilename = 'big-{}.csv'.format(outfileno)
outfile = open(outfilename, 'w')
outfileno += 1
writer = csv.writer(outfile)
writer.writerow(header)
writer.writerow(row)
# Don't forget to close the last file
if outfile is not None:
outfile.close()
Even though the script is running properly for smaller mock_data sets(below 1000 lines), it is not working properly for large datasets. Lets say the dataset is as follows:
"col1" "col2" "col3" "col4"
"100" "0100" "4900236731" "2019"
"100" "0100" "4900236731" "2019"
"100" "0100" "4900236731" "2019"
When I am running the script the smaller chunks are getting generated like following:
"col1 ""col2"" ""col3"" ""col4"""
"100 ""0100"" ""4900236731"" ""2019"""
"100 ""0100"" ""4900236731"" ""2019"""
"100 ""0100"" ""4900236731"" ""2019"""
The text qualifiers are getting misplaced. Any solution? Please note: I have tried using other codes to split the data but the issue is same for other codes and data.
Upvotes: 0
Views: 558
Reputation: 46779
In Python 3.x, you should open your CSV files using the parameter newline=''
. A tab delimiter can be specified using delimiter='\t'
. For example:
import csv
divisor = 500000
outfileno = 1
outfile = None
with open('mock_data.txt', 'r', newline='') as infile:
infile_iter = csv.reader(infile, delimiter='\t')
header = next(infile_iter)
for index, row in enumerate(infile_iter):
if index % divisor == 0:
if outfile:
outfile.close()
outfilename = 'big-{}.csv'.format(outfileno)
outfile = open(outfilename, 'w', newline='')
outfileno += 1
writer = csv.writer(outfile, delimiter='\t', quoting=csv.QUOTE_ALL)
writer.writerow(header)
writer.writerow(row)
# Don't forget to close the last file
if outfile:
outfile.close()
To force quoting on all of your fields, use quoting=csv.QUOTE_ALL
. This would then give you an output as follows, where all fields are double quoted, and are separated with a tab character:
"col1" "col2" "col3" "col4"
"100" "0100" "4900236731" "2019"
"100" "0100" "4900236731" "2019"
"100" "0100" "4900236731" "2019"
This can be verified by opening the file with a text editor. If the data does not look as expected, it implies an issue with your mock_data.txt
file. You would need to provide a link to a smaller sample from it that recreates the problem.
Upvotes: 1