Reputation: 181
I have a huge csv file with all our student rosters inside of it. So, 1) I want to separate the rosters into smaller csv files based on the course name. 2) If I can have the output csv file's name be equal to the course name (example: Algebra1.csv), that would make my life so much better. Is it possible to iterate through the courses_column of the csv file and when the name of the course changes it makes a new csv file for that course. I think I could read the keys of the dictionary 'read_rosters' and then do a while loop?
An example of the csv input file would look like this:
Here is what I have so far:
import os
import csv
path = "/PATH/TO/FILE"
with open(os.path.join(path, "student_rosters.csv"), "rU") as rosters:
read_rosters = csv.DictReader(rosters)
for row in read_rosters:
course_name = row['COURSES_COLUMN_HEADER']
csv_file = os.path.join(course_name, ".csv")
course_csv = csv.writer(open(csv_file, 'wb').next()
Upvotes: 0
Views: 3758
Reputation: 104722
In your current code, you're opening an output csv file for each line you read. This will be slow, and, as you've currently written it, it won't work. That's because using the "wb"
mode when you open the file erases everything that was in the file before. You might use an "a"
mode, but this will still be slow.
How you can best solve the problem depends a bit on your data. If you can rely upon the input always having the rows with the same course next to one another, you could use groupby
from the itertools
module to easily write the appropriate lines out together:
from itertools import groupby
from operator import itemgetter
with open(os.path.join(path, "student_rosters.csv"), "rb") as rosters:
reader = csv.DictReader(rosters)
for course, rows in groupby(reader, itemgetter('COURSES_COLUMN_HEADER')):
with open(os.path.join(path, course + ".csv"), "wb") as outfile:
writer = csv.DictWriter(outfile, reader.fieldnames)
writer.writerows(rows)
If you can't rely upon the organization of the rows, you have a couple options. One would be to read all the rows into a list, then sort them by course and use itertools.groupby
like in the code above.
Another option would be to keep reading just one line at a time, with each output row going into an appropriate file. I'd suggest keeping a dictionary of writer objects, indexed by course name. Here's what that could look like:
writers = {}
with open(os.path.join(path, "student_rosters.csv"), "rb") as rosters:
reader = csv.DictReader(rosters)
for row in reader:
course = row['COURSES_COLUMN_HEADER']
if course not in writers:
outfile = open(os.path.join(path, course + ".csv"), "wb")
writers[course] = csv.DictWriter(outfile, reader.fieldnames)
writers[course].writerow(row)
If you were using this in production, you'd probably want to add some code to close the files after you were done with them, since you can't use with
statements to close them automatically.
In my example codes above, I've made the code write out the full rows, just as they were in the input. If you don't want that, you can change the second argument to DictWriter
to a sequence of the column names you want to write. You'll also want to include the parameter extrasaction="ignore"
so that the extra values in the row dicts will be ignored when the columns you do want are written.
Upvotes: 1
Reputation: 113864
First, this is not what you want:
csv_file = os.path.join(course_name, ".csv")
It will create a file named .csv
in a subdirectory named course_name
. You likely want something like:
csv_file = os.path.join(path, course_name + ".csv")
Also, the following has two issues: (a) unbalanced parens and (b) writer
objects don't have a next
method:
course_csv = csv.writer(open(csv_file, 'wb').next()
Try instead:
course_csv = csv.writer(open(csv_file, 'wb'))
And, then you need to write something of your choosing to the new file, probably using a writeheader
, writerow
or writerows
method:
course_csv.writeheader(something_of_your_choosing)
course_csv.writerow(something_else_of_your_choosing)
Upvotes: 0