Reputation: 121
I am trying to split a csv into multiple files based on a few conditions. For instance, I have a csv as follows:
ID Timestamp Product Price
XX T1 P1 10
XX T2 P1 11
XX T2 P1 12
XX T3 P1 13
XX T3 P1 14
YY T1 P1 20
YY T1 P2 25
Expected output:
File 1: XX_P1_file1.csv
ID Timestamp Product Price
XX T1 P1. 10
XX T2 P1. 11
XX T3 P1 13
File 2: XX_P1_file2.csv
ID Timestamp Product Price
XX T2 P1 12
XX T3 P1 14
File 3: YY_P1_file1.csv
ID Timestamp Product Price
YY T1 P1 20
File 4: YY_P2_file1.csv
ID Timestamp Product Price
YY T1 P2 25
Currently, the code only looks for key(ID,Product), I want to create a condition around "Timestamp" to get the desired results and I am finding it tricky to add that. Code:
filein = open(filepath)
csvin = csv.DictReader(filein)
csv_files = {}
files = []
headers = ['ID','timestamp','product', 'price']
for row in csvin:
key = (row['ID'], row['product'])
if key not in csv_files:
# create the csv file
fileout = open('{}_{}.csv'.format(*key), 'w')
dw = csv.DictWriter(fileout, headers, extrasaction='ignore')
dw.writeheader()
csv_files[key] = dw
files.append(fileout) # to close them later
# write the line into to corresponding csv writer
csv_files[key].writerow(row)
Any help would be appreciated. Thanks!
Upvotes: 1
Views: 512
Reputation: 16673
This should do the trick with creating a “File” column with .cumcount
. This column will be used later to help create the file names dynamically and then dropped before sent to multiple dynamically named . csv files in a loop with .groupby
on the two columns necessary to group the the datasets and accompanying filenames into dynamic files. You don’t need to specify any logic for the “Product” column since the “Timestamp” resets to 1, so it would be flagged as having to go into a new file.
import pandas as pd
df = pd.read_csv('your_filename.csv')
df['File'] = df.groupby(['ID', 'Timestamp']).cumcount()+1
for (i,f), x in df.groupby(['ID', 'File']):
x.drop('File', axis=1).to_csv(f'{i}_T{f}_file{f}.csv', index=False)
df
output:
ID Timestamp Product Price File
0 XX T1 P1 10 1
1 XX T2 P1 11 1
3 XX T3 P1 13 1
ID Timestamp Product Price File
2 XX T2 P1 12 2
4 XX T3 P1 14 2
ID Timestamp Product Price File
5 YY T1 P1 20 1
ID Timestamp Product Price File
6 YY T1 P2 25 2
Upvotes: 2
Reputation: 177461
Here's a modification to your code that works. It tracks instances of ID/Product keys to direct the timestamp to the correct file. It assumes your file is already sorted by the sortkey (a requirement for itertools.groupby
) but you can pre-read and sort all the lines in with csvin=sorted(list(csv.DictReader(filein)),key=sortkey)
instead if needed.
import csv
import itertools
import operator
headers = ['ID', 'Timestamp', 'Product', 'Price']
sortkey = operator.itemgetter('ID', 'Product', 'Timestamp')
files = {}
with open('input.csv', newline='') as filein:
csvin = csv.DictReader(filein)
for (id_, product, timestamp), group in itertools.groupby(csvin, key=sortkey):
for instance, row in enumerate(group, 1):
key = id_, product, instance
if key not in files:
filename = f'{id_}_{product}_file{instance}.csv'
print(f'Starting {filename}')
fileout = open(filename, 'w', newline='')
writer = csv.DictWriter(fileout, headers)
writer.writeheader()
files[key] = fileout, writer
files[key][1].writerow(row)
print(f'Closing {len(files)} output files')
for openfile, _ in files.values():
openfile.close()
Output:
Starting XX_P1_file1.csv
Starting XX_P1_file2.csv
Starting YY_P1_file1.csv
Starting YY_P2_file1.csv
Closing 4 output files
Files match your desired output given your input.
Upvotes: 0