Reputation: 261
With the help of python community I started learning python to process around 500 millions (40G) of data and wrote the following script.
Input File Format -
Studentid,Subject,DateTime,Grade
001,Biology,Mon Apr 25 19:32:00 PDT 2013,B
001,Literature,Wed Apr 10 15:31:00 PST 2013,B
001,Math,Mon Apr 22 01:32:00 PDT 2013,A
002,Biology,Mon Apr 25 19:32:00 PDT 2013,A
002,Math,Mon Apr 22 16:31:14 PDT 2013,C
002,Math,Wed Apr 10 15:31:00 PST 2013,C
003,Biology,Mon Apr 22 13:31:00 PDT 2013,A
003,Irdu,Wed Apr 10 15:31:00 PST 2013,A
Output Report
003,Irdu;Wed Apr 10 15:31:00 PST 2013;A#Biology;Mon Apr 22 13:31:00 PDT 2013;A
002,Math;Wed Apr 10 15:31:00 PST 2013;C#Math;Mon Apr 22 16:31:14 PDT 2013;C#Biology;Mon Apr 25 19:32:00 PDT 2013;A
001,Literature;Wed Apr 10 15:31:00 PST 2013;B#Math;Mon Apr 22 01:32:00 PDT 2013;A#Biology;Mon Apr 25 19:32:00 PDT 2013;B
Python Script
import csv
import time
import operator
import sys, getopt
import os
from collections import defaultdict
from datetime import datetime
from operator import itemgetter
start = time.time()
def elapsed():
return time.time() - start
def date_key(row):
try:
formatRow = row[1].replace('PDT ','')
formatRow = formatRow.replace('PST ','')
return datetime.strptime(formatRow, "%a %b %d %X %Y")
except Exception, e:
print ("Error in sorting the date: %s \nRow : %s" % (e, row))
pass
def processRecords(accountsData, fileName):
for v in accountsData.itervalues():
try:
v.sort(key=date_key)
except Exception, e:
pass
with open(fileName, 'a') as writer:
for pid,v in accountsData.iteritems():
csv = '#'.join([';'.join(t) for t in v])
writer.write("%s,%s\n" % (pid, csv))
def main(argv):
inputFile = ''
outputFile = ''
batchsize = 20000000
try:
opts, args = getopt.getopt(argv,"hi:o:b:",["ifile=","ofile=","bsize="])
except getopt.GetoptError:
print 'ReportToFileBatches.py -i <inputfile> -o <outputfile> -b<batchsize>[default=20000000]'
sys.exit(2)
for opt, arg in opts:
if opt == '-h':
print 'ReportToFileBatches.py -i <inputfile> -o <outputfile> -b<batchsize>[default=20000000]'
sys.exit()
elif opt in ("-i", "--ifile"):
inputFile = arg
elif opt in ("-o", "--ofile"):
outputFile = arg
elif opt in ("-b", "--bsize"):
batchsize = int(arg)
if not (os.path.isfile(inputFile)):
print ("\nError : File - %s does not exist." % (inputFile))
sys.exit(2)
#print "Batch Size %s " % batchsize
linenumb = 0
with open(inputFile,'r') as data:
accounts = defaultdict(list)
for line in data:
linenumb = linenumb + 1
line = line.rstrip('\r\n')
try:
sid, subject, datetime, grade = line.split(',')
accounts[sid].append((subject, datetime, grade))
if (linenumb == batchsize):
linenumb = 0
processRecords(accounts, outputFile)
accounts = defaultdict(list)
else: continue
except Exception, e:
print ("Error : %s \nRow : %s" % (e, line))
if(linenumb > 0):
processRecords(accounts, outputFile)
print("Total time taken - %.3fs" % elapsed())
if __name__ == "__main__":
main(sys.argv[1:])
You can see the output file (report) is ordered by date and also concatenation of the fields. I am spending more time on sorting the datetime column (maybe). I am a new comer to Python. I really appreciate any help in improving my script to reduce the processing time. Hope I am making sense.
FYI : I am making sure the input file is sorted by studentid and processing in batches.
Upvotes: 1
Views: 149
Reputation: 5877
I can't even imagine wanting to do this with any ammount of profiling or algorythm optimization.
This strikes me as a database problem.
python is the glue language to do the formatting and parsing. Not the language to roll your own management of 40G of data.
Upvotes: 1
Reputation: 858
I think the best thing you can do is a pigeonhole sort. The this works if you know the bounds and spread of your time data, i.e. about the biggest and smallest times. Pigeonholing works by segregating your data into discrete pigeonholes. For example, if your data is distributed about the month then you could have an array where every index is an array representing 1 hour of every day of that month. You walk the list, placing your data into that array. When you place the data into that array, then you can sort it into that sublist.
This algorithm is super efficient IF you choose your pigeonholes properly. For example, if all your data was within the same day and your pigeon hole was by day then this algorithm performs no better than yours. Another note is that you pigeonholes don't have to be evenly spread out. If almost all your data is in the same hour on different days, then add smaller time steps in that hour and bigger ones for the rest.
Counting sort is another option (also has a better runtime), but since your file is so big I think you might run into memory issues or lag from constant disk writes. Worth looking into but be careful with those delays.
EDIT: Just realized that your file is 40g big so to account for that memory issue, instead of an array of arrays, you could have a series of files with known names that contain your pigeonholed lists.
Upvotes: 0