Splitting a csv into multiple csvs

Question

I am trying to split a csv into multiple files based on a few conditions. For instance, I have a csv as follows:

ID    Timestamp  Product  Price
XX      T1         P1       10  
XX      T2         P1       11
XX      T2         P1       12
XX      T3         P1       13
XX      T3         P1       14
YY      T1         P1       20
YY      T1         P2       25

Expected output:

File 1: XX_P1_file1.csv

ID    Timestamp  Product  Price
XX      T1         P1.      10  
XX      T2         P1.      11
XX      T3         P1       13

File 2: XX_P1_file2.csv

ID    Timestamp  Product  Price
XX      T2         P1       12
XX      T3         P1       14

File 3: YY_P1_file1.csv

ID    Timestamp  Product  Price
YY      T1         P1       20

File 4: YY_P2_file1.csv

ID    Timestamp  Product  Price
YY      T1         P2       25

Currently, the code only looks for key(ID,Product), I want to create a condition around "Timestamp" to get the desired results and I am finding it tricky to add that. Code:

    filein = open(filepath)
    csvin = csv.DictReader(filein)
    csv_files = {}
    files = []
    headers = ['ID','timestamp','product', 'price']

    for row in csvin:
            key = (row['ID'], row['product'])
            if key not in csv_files:
                # create the csv file
                fileout = open('{}_{}.csv'.format(*key), 'w')
                dw = csv.DictWriter(fileout, headers, extrasaction='ignore')
                dw.writeheader()
                csv_files[key] = dw
                files.append(fileout)  # to close them later

            # write the line into to corresponding csv writer
            csv_files[key].writerow(row)

Any help would be appreciated. Thanks!

Mark Tolonen · Accepted Answer

Here's a modification to your code that works. It tracks instances of ID/Product keys to direct the timestamp to the correct file. It assumes your file is already sorted by the sortkey (a requirement for itertools.groupby) but you can pre-read and sort all the lines in with csvin=sorted(list(csv.DictReader(filein)),key=sortkey) instead if needed.

import csv
import itertools
import operator

headers = ['ID', 'Timestamp', 'Product', 'Price']
sortkey = operator.itemgetter('ID', 'Product', 'Timestamp')
files = {}

with open('input.csv', newline='') as filein:
    csvin = csv.DictReader(filein)
    for (id_, product, timestamp), group in itertools.groupby(csvin, key=sortkey):
        for instance, row in enumerate(group, 1):
            key = id_, product, instance
            if key not in files:
                filename = f'{id_}_{product}_file{instance}.csv'
                print(f'Starting {filename}')
                fileout = open(filename, 'w', newline='')
                writer = csv.DictWriter(fileout, headers)
                writer.writeheader()
                files[key] = fileout, writer
            files[key][1].writerow(row)

print(f'Closing {len(files)} output files')
for openfile, _ in files.values():
    openfile.close()

Output:

Starting XX_P1_file1.csv
Starting XX_P1_file2.csv
Starting YY_P1_file1.csv
Starting YY_P2_file1.csv
Closing 4 output files

Files match your desired output given your input.

Splitting a csv into multiple csvs

Answers (2)

Related Questions