sweeeeeet
sweeeeeet

Reputation: 1819

concatenating csv files nicely with python

My program first clusters a big dataset in 100 clusters, then run a model on each cluster of the dataset using multiprocessing. My goal is to concatenate all the output values in one big csv file which is the concatenation of all output datas from the 100 fitted models.

For now, I am just creating 100 csv files, then loop on the folder containing these files and copying them one by one and line by line in a big file.

My question: is there a smarter method to get this big output file without exporting 100 files. I use pandas and scikit-learn for data processing, and multiprocessing for parallelization.

Upvotes: 0

Views: 260

Answers (3)

Rolf of Saxony
Rolf of Saxony

Reputation: 22443

Pinched the guts of this from http://computer-programming-forum.com/56-python/b7650ebd401d958c.htm it's a gem.

#!/usr/bin/python
# -*- coding: utf-8 -*-
from glob import glob
n=1
file_list = glob('/home/rolf/*.csv')
concat_file = open('concatenated.csv','w')
files = map(lambda f: open(f, 'r').read, file_list)
print "There are {x} files to be concatenated".format(x=len(files))    
for f in files:
    print "files added {n}".format(n=n)
    concat_file.write(f())
    n+=1
concat_file.close()

Upvotes: 1

tlastowka
tlastowka

Reputation: 702

have your processing threads return the dataset to the main process rather than writing the csv files themselves, then as they give data back to your main process, have it write them to one continuous csv.

from multiprocessing import Process, Manager

def worker_func(proc_id,results):

    # Do your thing

    results[proc_id] = ["your dataset from %s" % proc_id]

def convert_dataset_to_csv(dataset):

    # Placeholder example.  I realize what its doing is ridiculous

    converted_dataset = [ ','.join(data.split()) for data in dataset]
    return  converted_dataset

m = Manager()
d_results= m.dict()

worker_count = 100

jobs = [Process(target=worker_func,
        args=(proc_id,d_results))
        for proc_id in range(worker_count)]

for j in jobs:
    j.start()

for j in jobs:
    j.join()


with open('somecsv.csv','w') as f:

    for d in d_results.values():

        # if the actual conversion function benefits from multiprocessing,
        # you can do that there too instead of here

        for r in convert_dataset_to_csv(d):
            f.write(r + '\n')

Upvotes: 1

Matt Anderson
Matt Anderson

Reputation: 19779

If all of your partial csv files have no headers and share column number and order, you can concatenate them like this:

with open("unified.csv", "w") as unified_csv_file:
    for partial_csv_name in partial_csv_names:
        with open(partial_csv_name) as partial_csv_file:
            unified_csv_file.write(partial_csv_file.read())

Upvotes: 1

Related Questions