Process a huge .csv file in python

Question

As part of a bigger problem I am working on where I have to read in a set of .csv files and manipulate them, generate a new set of .csv files. Everything is smooth for EXCEPT one file: voltvalues.csv. The content of the file looks like this:

... 13986513,6,6/1/2014 12:00:00 AM,248.7 13986513,6,6/1/2014 12:00:05 AM,248.4 13986513,6,6/1/2014 12:00:10 AM,249 13986513,6,6/1/2014 12:00:15 AM,249.3 13986513,6,6/1/2014 12:00:20 AM,249.3 13986513,6,6/1/2014 12:00:25 AM,249.3 ... 13986513,6,6/30/2014 11:55:00 PM,249.3 13986534,6,6/1/2014 12:00:00 AM,249 13986534,6,6/1/2014 12:00:05 AM,249 13986534,6,6/1/2014 12:00:10 AM,249.3 13986534,6,6/1/2014 12:00:15 AM,249.6 ... 13986534,6,6/30/2014 11:55:00 PM,249.7 ...

I am trying to spit out another .csv file: newvolt.csv that has the data in the following format: timestamp,13986513,13986534,... 2014-06-01 12:00:00 PDT,248.7,249.3,... 2014-06-01 12:00:05 PDT,248.4,249,... ... 2014-06-30 23:55:00 PDT,249.3,249.7,... Problem(s) with this file is the size OF voltvalues.csv: 6GB (aboud 1billion rows and 4 columns). so the way I am reading is by something like this:

#meters=[]
real_recorder  = open("newvolt.csv",'w')
with open("voltvalues.csv",'rb') as voltfile:
    voltread = csv.reader(voltfile)
    next(voltread)#skip header
    for line in voltread:
        #convert the data of voltvalues.csv into the format I desire
        #BEST WAY to do it?
        real_recorder.writelines([...])
        #meters.append(line[0])
#print len(meters)
#print len(set(meters))

I know python's datetime module has some methods to change one datetime format to other but in this case, it is very expensive in terms of memory. Any suggestions on the best way to make the whole conversion?

tdelaney · Accepted Answer

You could scan the file and record the starting offset for each sensor. To read the next value for a given sensor, seek to that offset, read a line and update the offset. With this approach you don't need to keep as much data in local memory, but you are depending on the operating system RAM cache for performance. This may be a good place to use a memory mapped file instead.

It gets more complicated if sensors don't all have the same time value, but this is a start:

open('data.csv','w').write(
"""\
13986513,6,6/1/2014 12:00:00 AM,248.7
13986513,6,6/1/2014 12:00:05 AM,248.4
13986513,6,6/1/2014 12:00:10 AM,249
13986513,6,6/1/2014 12:00:15 AM,249.3
13986513,6,6/1/2014 12:00:20 AM,249.3
13986513,6,6/1/2014 12:00:25 AM,249.3
13986513,6,6/30/2014 11:55:00 PM,249.3
13986534,6,6/1/2014 12:00:00 AM,249
13986534,6,6/1/2014 12:00:05 AM,249
13986534,6,6/1/2014 12:00:10 AM,249.3
13986534,6,6/1/2014 12:00:15 AM,249.6
13986534,6,6/30/2014 11:55:00 PM,249.7\
""")

class ReadSensorLines(object):

    def __init__(self, filename):

        sensor_offsets = {}
        sensors = []

        readfp = open(filename, "rb")
        readfp.readline() # skip header

        # find start of each sensor
        # use readline not iteration so that tell offset is right

        offset = readfp.tell()
        sensor = ''

        while True:
            line = readfp.readline()
            if not line:
                break
            next_sensor = line.split(',', 1)[0]
            if next_sensor != sensor:
                if sensor:
                    sensors.append(sensor)
                    next_offset = readfp.tell()
                    sensor_offsets[sensor] = [offset, next_offset - offset]
                    sensor = next_sensor
                    offset = next_offset
                else:
                    # setup for first sensor
                    sensor = next_sensor
        if next_sensor:
            sensors.append(next_sensor)
            sensor_offsets[next_sensor] = [offset, readfp.tell() - offset]

        self.readfp = readfp
        self.sensor_offsets = sensor_offsets
        self.sensors = sensors

    def read_sensor(self, sensorname):
        pos_data = self.sensor_offsets[sensorname]
        self.readfp.seek(pos_data[0])
        line = self.readfp.readline(pos_data[1])
        pos_data[0] += len(line)
        pos_data[1] -= len(line)
        return line

    @property
    def data_remains(self):
        return any(pos_data[1] for pos_data in self.sensor_offsets.itervalues())

    def close(self):
        self.readfp.close()


sensor_lines = ReadSensorLines("data.csv")
while sensor_lines.data_remains:
    row = []
    for sensor in sensor_lines.sensors:
        sensor_line = sensor_lines.read_sensor(sensor)
        if sensor_line:
            _, _, date, volts = sensor_line.strip().split(',')
            row.append(volts)
        else:
            row.append('')
    row.insert(0, date)
    row[0] = str(datetime.datetime.strptime(row[0],'%m/%d/%Y %H:%M'))
    print ','.join(row)

Process a huge .csv file in python

Answers (2)

Related Questions