Christopher Zion Hall
Christopher Zion Hall

Reputation: 53

Split a large csv file based on date in First column Python 3.4.3

Ok, So I have found part of the answer I need here in the following Link and it is working as long as my csv files are in a 2015-03-01,1,2,3,1,3 format for the first column. How do I keep this working when the first column is changes to 2015-03-01 00:00:00.000

How to split a huge csv file based on content of first column?

import csv
from itertools import groupby

for key, rows in groupby(csv.reader(open("largeFile.csv", "r", encoding='utf-16')),
                     lambda row: row[0]):
with open("%s.txt" % key, "w") as output:
    for row in rows:
        output.write(",".join(row) + "\n")

So I have one large file with roughly 1.7million rows in it...

2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.03,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.03,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

And the program does create a new text document for each day which is great!

But when the columns are as follows, it stops working.

2015-03-01 00:00:01.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

2015-03-01 00:00:02.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

2015-03-02 00:00:01.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

2015-03-02 00:00:02.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

2015-03-02 00:00:03.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

2015-03-03 00:00:01.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

2015-03-03 00:00:02.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

and it gives me the following error.

Traceback (most recent call last): File "C:\Python34\Proj\documents\New folder\dataPullSplit2.py", line 6, in with open("%s.txt" % key, "w") as output: OSError: [Errno 22] Invalid argument: '2015-03-01 00:00:00.000.txt'

Can someone point me in the Right Direction here please.

Found Temp Solution

OK, so by changing it from "w" to "a" I am now appending to the files and by using key[:-13] i was able to cut off the time stamp on the file name... it works... but it is SLOW... how can i improve this and understand why it is going so slow?

Here is the code now

import csv
from itertools import groupby

for key, rows in groupby(csv.reader(open("asdf2.txt", "r", encoding='utf-16')),
                     lambda row: row[0]):

with open("%s.txt" % key[:-13], "a") as output:
    for row in rows:
        output.write(",".join(row) + "\n")

Upvotes: 1

Views: 1406

Answers (1)

Mike Müller
Mike Müller

Reputation: 85452

Assuming your files should keep the pattern 2015.01.01, cleaning the key should work:

key = key.split()[0].replace('-', '.')

Full code:

import csv
from itertools import groupby


def shorten_key(key):
    return key.split()[0].replace('-', '.')


for key, rows in groupby(csv.reader(open("asdf2.txt", "r", encoding='utf-16')),
                         lambda row: shorten_key(row[0])):

    with open("%s.txt" % shorten_key(key), "a") as output:
        for row in rows:
            output.write(",".join(row) + "\n")

A quick test:

keys = ['2015-03-01 00:00:02.000',  '2015.01.01']

for key in keys:
    print(key.split()[0].replace('-', '.'))

Output:

2015.03.01
2015.01.01

Upvotes: 1

Related Questions