Reputation: 53
Ok, So I have found part of the answer I need here in the following Link and it is working as long as my csv files are in a 2015-03-01,1,2,3,1,3
format for the first column. How do I keep this working when the first column is changes to 2015-03-01 00:00:00.000
How to split a huge csv file based on content of first column?
import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("largeFile.csv", "r", encoding='utf-16')),
lambda row: row[0]):
with open("%s.txt" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")
So I have one large file with roughly 1.7million rows in it...
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.03,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.03,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
And the program does create a new text document for each day which is great!
But when the columns are as follows, it stops working.
2015-03-01 00:00:01.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015-03-01 00:00:02.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015-03-02 00:00:01.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015-03-02 00:00:02.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015-03-02 00:00:03.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015-03-03 00:00:01.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015-03-03 00:00:02.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
and it gives me the following error.
Traceback (most recent call last): File "C:\Python34\Proj\documents\New folder\dataPullSplit2.py", line 6, in with open("%s.txt" % key, "w") as output: OSError: [Errno 22] Invalid argument: '2015-03-01 00:00:00.000.txt'
Can someone point me in the Right Direction here please.
Found Temp Solution
OK, so by changing it from "w" to "a" I am now appending to the files and by using key[:-13]
i was able to cut off the time stamp on the file name... it works... but it is SLOW... how can i improve this and understand why it is going so slow?
Here is the code now
import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("asdf2.txt", "r", encoding='utf-16')),
lambda row: row[0]):
with open("%s.txt" % key[:-13], "a") as output:
for row in rows:
output.write(",".join(row) + "\n")
Upvotes: 1
Views: 1406
Reputation: 85452
Assuming your files should keep the pattern 2015.01.01
, cleaning the key
should work:
key = key.split()[0].replace('-', '.')
Full code:
import csv
from itertools import groupby
def shorten_key(key):
return key.split()[0].replace('-', '.')
for key, rows in groupby(csv.reader(open("asdf2.txt", "r", encoding='utf-16')),
lambda row: shorten_key(row[0])):
with open("%s.txt" % shorten_key(key), "a") as output:
for row in rows:
output.write(",".join(row) + "\n")
A quick test:
keys = ['2015-03-01 00:00:02.000', '2015.01.01']
for key in keys:
print(key.split()[0].replace('-', '.'))
Output:
2015.03.01
2015.01.01
Upvotes: 1