Siddhartha
Siddhartha

Reputation: 53

Text qualifiers getting misplaced while splitting csv file using Python

I am trying to split large csv files in smaller chunks and load the data to sql for further analyse on the chunks. But when I am running the following code, the text qualifiers are getting misplaced and it is hampering the csv file and we are unable to load the data:

import csv

divisor = 500000

outfileno = 1
outfile = None

with open('mock_data.txt', 'r') as infile:
    infile_iter = csv.reader(infile)
    header = next(infile_iter)
    for index, row in enumerate(infile_iter):
        if index % divisor == 0:
            if outfile is not None:
                outfile.close()
            outfilename = 'big-{}.csv'.format(outfileno)
            outfile = open(outfilename, 'w')
            outfileno += 1
            writer = csv.writer(outfile)
            writer.writerow(header)
        writer.writerow(row)
    # Don't forget to close the last file
    if outfile is not None:
        outfile.close()

Even though the script is running properly for smaller mock_data sets(below 1000 lines), it is not working properly for large datasets. Lets say the dataset is as follows:

"col1"  "col2"  "col3"  "col4"
"100"   "0100"  "4900236731"    "2019"
"100"   "0100"  "4900236731"    "2019"
"100"   "0100"  "4900236731"    "2019"

When I am running the script the smaller chunks are getting generated like following:

"col1   ""col2""    ""col3""    ""col4"""
"100    ""0100""    ""4900236731""  ""2019"""
"100    ""0100""    ""4900236731""  ""2019"""
"100    ""0100""    ""4900236731""  ""2019"""

The text qualifiers are getting misplaced. Any solution? Please note: I have tried using other codes to split the data but the issue is same for other codes and data.

Upvotes: 0

Views: 558

Answers (1)

Martin Evans
Martin Evans

Reputation: 46779

In Python 3.x, you should open your CSV files using the parameter newline=''. A tab delimiter can be specified using delimiter='\t'. For example:

import csv

divisor = 500000
outfileno = 1
outfile = None

with open('mock_data.txt', 'r', newline='') as infile:
    infile_iter = csv.reader(infile, delimiter='\t')
    header = next(infile_iter)

    for index, row in enumerate(infile_iter):
        if index % divisor == 0:
            if outfile:
                outfile.close()

            outfilename = 'big-{}.csv'.format(outfileno)
            outfile = open(outfilename, 'w', newline='')
            outfileno += 1
            writer = csv.writer(outfile, delimiter='\t', quoting=csv.QUOTE_ALL)
            writer.writerow(header)

        writer.writerow(row)

    # Don't forget to close the last file
    if outfile:
        outfile.close()

To force quoting on all of your fields, use quoting=csv.QUOTE_ALL. This would then give you an output as follows, where all fields are double quoted, and are separated with a tab character:

"col1"  "col2"  "col3"  "col4"
"100"   "0100"  "4900236731"    "2019"
"100"   "0100"  "4900236731"    "2019"
"100"   "0100"  "4900236731"    "2019"

This can be verified by opening the file with a text editor. If the data does not look as expected, it implies an issue with your mock_data.txt file. You would need to provide a link to a smaller sample from it that recreates the problem.

Upvotes: 1

Related Questions