Sohail
Sohail

Reputation: 1147

How can I split a large file csv file (7GB) in Python

I have a 7GB csv file which I'd like to split into smaller chunks, so it is readable and faster for analysis in Python on a notebook. I would like to grab a small set from it, maybe 250MB, so how can I do this?

Upvotes: 35

Views: 49246

Answers (9)

Soumil Nitin Shah
Soumil Nitin Shah

Reputation: 730

here is my code which might help

import os
import pandas as pd
import uuid


class FileSettings(object):
    def __init__(self, file_name, row_size=100):
        self.file_name = file_name
        self.row_size = row_size


class FileSplitter(object):

    def __init__(self, file_settings):
        self.file_settings = file_settings

        if type(self.file_settings).__name__ != "FileSettings":
            raise Exception("Please pass correct instance ")

        self.df = pd.read_csv(self.file_settings.file_name,
                              chunksize=self.file_settings.row_size)

    def run(self, directory="temp"):

        try:os.makedirs(directory)
        except Exception as e:pass

        counter = 0

        while True:
            try:
                file_name = "{}/{}_{}_row_{}_{}.csv".format(
                    directory,  self.file_settings.file_name.split(".")[0], counter, self.file_settings.row_size, uuid.uuid4().__str__()
                )
                df = next(self.df).to_csv(file_name)
                counter = counter + 1
            except StopIteration:
                break
            except Exception as e:
                print("Error:",e)
                break

        return True


def main():
    helper =  FileSplitter(FileSettings(
        file_name='sample1.csv',
        row_size=10
    ))
    helper.run()

main()

Upvotes: 1

user17870315
user17870315

Reputation: 1

In the case of wanting to split by rough boundaries in bytes, the newest datapoints being the bottom-most ones and wanting to put the newest datapoints in the first file:

from pathlib import Path
    
TEN_MB = 10000000
FIVE_MB = 5000000

def split_file_into_chunks(path, chunk_size=TEN_MB):
    path = str(path)
    output_prefix = path.rpartition('.')[0]
    output_ext = path.rpartition('.')[-1]

    with open(path, 'rb') as f:
        seek_positions = []
        for x, line in enumerate(f):
            if not x:
                header = line
            seek_positions.append(f.tell())

        part = 0
        last_seek_pos = seek_positions[-1]
        for seek_pos in reversed(seek_positions):
            if last_seek_pos-seek_pos >= chunk_size:
                with open(f'{output_prefix}.arch.{part}.{output_ext}', 'wb') as f_out:
                    f.seek(seek_pos)
                    f_out.write(header)
                    f_out.write(f.read(last_seek_pos-seek_pos))

                last_seek_pos = seek_pos
                part += 1

        with open(f'{output_prefix}.arch.{part}.{output_ext}', 'wb') as f_out:
            f.seek(0)
            f_out.write(f.read(last_seek_pos))

    Path(path).rename(path+'~')
    Path(f'{output_prefix}.arch.0.{output_ext}').rename(path)
    Path(path+'~').unlink()

Upvotes: 0

Powers
Powers

Reputation: 19308

This graph shows the runtime difference of the different approaches outlined by other posters (on an 8 core machine when splitting a 2.9 GB file with 11.8 million rows of data into ~290 files).

enter image description here

The shell approach is from Thomas Orozco, Python approach s from Roberto, Pandas approach is from Quentin Febvre and here's the Dask snippet:

ddf = dd.read_csv("../nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2015.csv", blocksize=10000000, dtype=dtypes)
ddf.to_csv("../tmp/split_csv_dask")

I'd recommend Dask for splitting files, even though it's not the fastest, because it's the most flexible solution (you can write out different file formats, perform processing operations before writing, easily modify compression formats, etc.). The Pandas approach is almost as flexible, but cannot perform processing on the entire dataset (like sorting the entire dataset before writing).

Bash / native Python filesystem operations are clearly quicker, but that's not what I'm typically looking for when I have a large CSV. I'm typically interested in splitting large CSVs into smaller Parquet files, for performant, production data analyses. I don't usually care if the actually splitting takes a couple minutes more. I'm more interested in splitting accurately.

I wrote a blog post that discusses this in more detail. You can probably Google around and find the post.

Upvotes: 6

Roberto
Roberto

Reputation: 151

Here is a little python script I used to split a file data.csv into several CSV part files. The number of part files can be controlled with chunk_size (number of lines per part file).

The header line (column names) of the original file is copied into every part CSV file.

It works for big files because it reads one line at a time with readline() instead of loading the complete file into memory at once.

#!/usr/bin/env python3

def main():
    chunk_size = 9998  # lines

    def write_chunk(part, lines):
        with open('data_part_'+ str(part) +'.csv', 'w') as f_out:
            f_out.write(header)
            f_out.writelines(lines)

    with open('data.csv', 'r') as f:
        count = 0
        header = f.readline()
        lines = []
        for line in f:
            count += 1
            lines.append(line)
            if count % chunk_size == 0:
                write_chunk(count // chunk_size, lines)
                lines = []
        # write remainder
        if len(lines) > 0:
            write_chunk((count // chunk_size) + 1, lines)

if __name__ == '__main__':
    main()

Upvotes: 13

Quentin Febvre
Quentin Febvre

Reputation: 481

I had to do a similar task, and used the pandas package:

for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
    chunk.to_csv('chunk{}.csv'.format(i), index=False)

Upvotes: 33

Jimmy
Jimmy

Reputation: 55

I agree with @jonrsharpe readline should be able to read one line at a time even for big files.

If you are dealing with big csv files might I suggest using pandas.read_csv. I often use it for the same purpose and always find it awesome (and fast). Takes a bit of time to get used to idea of DataFrames. But once you get over that it speeds up large operations like yours massively.

Hope it helps.

Upvotes: 1

dstromberg
dstromberg

Reputation: 7167

Maybe something like this?

#!/usr/local/cpython-3.3/bin/python

import csv

divisor = 10

outfileno = 1
outfile = None

with open('big.csv', 'r') as infile:
    for index, row in enumerate(csv.reader(infile)):
        if index % divisor == 0:
            if outfile is not None:
                outfile.close()
            outfilename = 'big-{}.csv'.format(outfileno)
            outfile = open(outfilename, 'w')
            outfileno += 1
            writer = csv.writer(outfile)
        writer.writerow(row)

Upvotes: 3

Thomas Orozco
Thomas Orozco

Reputation: 55199

You don't need Python to split a csv file. Using your shell:

$ split -l 100 data.csv

Would split data.csv in chunks of 100 lines.

Upvotes: 49

jonrsharpe
jonrsharpe

Reputation: 121974

See the Python docs on file objects (the object returned by open(filename) - you can choose to read a specified number of bytes, or use readline to work through one line at a time.

Upvotes: 3

Related Questions