Reputation: 561
As part of a bigger problem I am working on where I have to read in a set of .csv files and manipulate them, generate a new set of .csv files. Everything is smooth for EXCEPT one file: voltvalues.csv. The content of the file looks like this:
...
13986513,6,6/1/2014 12:00:00 AM,248.7
13986513,6,6/1/2014 12:00:05 AM,248.4
13986513,6,6/1/2014 12:00:10 AM,249
13986513,6,6/1/2014 12:00:15 AM,249.3
13986513,6,6/1/2014 12:00:20 AM,249.3
13986513,6,6/1/2014 12:00:25 AM,249.3
...
13986513,6,6/30/2014 11:55:00 PM,249.3
13986534,6,6/1/2014 12:00:00 AM,249
13986534,6,6/1/2014 12:00:05 AM,249
13986534,6,6/1/2014 12:00:10 AM,249.3
13986534,6,6/1/2014 12:00:15 AM,249.6
...
13986534,6,6/30/2014 11:55:00 PM,249.7
...
I am trying to spit out another .csv file: newvolt.csv that has the data in the following format:
timestamp,13986513,13986534,...
2014-06-01 12:00:00 PDT,248.7,249.3,...
2014-06-01 12:00:05 PDT,248.4,249,...
...
2014-06-30 23:55:00 PDT,249.3,249.7,...
Problem(s) with this file is the size OF voltvalues.csv: 6GB (aboud 1billion rows and 4 columns). so the way I am reading is by something like this:
#meters=[]
real_recorder = open("newvolt.csv",'w')
with open("voltvalues.csv",'rb') as voltfile:
voltread = csv.reader(voltfile)
next(voltread)#skip header
for line in voltread:
#convert the data of voltvalues.csv into the format I desire
#BEST WAY to do it?
real_recorder.writelines([...])
#meters.append(line[0])
#print len(meters)
#print len(set(meters))
I know python's datetime
module has some methods to change one datetime format to other but in this case, it is very expensive in terms of memory. Any suggestions on the best way to make the whole conversion?
Upvotes: 0
Views: 163
Reputation: 77407
You could scan the file and record the starting offset for each sensor. To read the next value for a given sensor, seek to that offset, read a line and update the offset. With this approach you don't need to keep as much data in local memory, but you are depending on the operating system RAM cache for performance. This may be a good place to use a memory mapped file instead.
It gets more complicated if sensors don't all have the same time value, but this is a start:
open('data.csv','w').write(
"""\
13986513,6,6/1/2014 12:00:00 AM,248.7
13986513,6,6/1/2014 12:00:05 AM,248.4
13986513,6,6/1/2014 12:00:10 AM,249
13986513,6,6/1/2014 12:00:15 AM,249.3
13986513,6,6/1/2014 12:00:20 AM,249.3
13986513,6,6/1/2014 12:00:25 AM,249.3
13986513,6,6/30/2014 11:55:00 PM,249.3
13986534,6,6/1/2014 12:00:00 AM,249
13986534,6,6/1/2014 12:00:05 AM,249
13986534,6,6/1/2014 12:00:10 AM,249.3
13986534,6,6/1/2014 12:00:15 AM,249.6
13986534,6,6/30/2014 11:55:00 PM,249.7\
""")
class ReadSensorLines(object):
def __init__(self, filename):
sensor_offsets = {}
sensors = []
readfp = open(filename, "rb")
readfp.readline() # skip header
# find start of each sensor
# use readline not iteration so that tell offset is right
offset = readfp.tell()
sensor = ''
while True:
line = readfp.readline()
if not line:
break
next_sensor = line.split(',', 1)[0]
if next_sensor != sensor:
if sensor:
sensors.append(sensor)
next_offset = readfp.tell()
sensor_offsets[sensor] = [offset, next_offset - offset]
sensor = next_sensor
offset = next_offset
else:
# setup for first sensor
sensor = next_sensor
if next_sensor:
sensors.append(next_sensor)
sensor_offsets[next_sensor] = [offset, readfp.tell() - offset]
self.readfp = readfp
self.sensor_offsets = sensor_offsets
self.sensors = sensors
def read_sensor(self, sensorname):
pos_data = self.sensor_offsets[sensorname]
self.readfp.seek(pos_data[0])
line = self.readfp.readline(pos_data[1])
pos_data[0] += len(line)
pos_data[1] -= len(line)
return line
@property
def data_remains(self):
return any(pos_data[1] for pos_data in self.sensor_offsets.itervalues())
def close(self):
self.readfp.close()
sensor_lines = ReadSensorLines("data.csv")
while sensor_lines.data_remains:
row = []
for sensor in sensor_lines.sensors:
sensor_line = sensor_lines.read_sensor(sensor)
if sensor_line:
_, _, date, volts = sensor_line.strip().split(',')
row.append(volts)
else:
row.append('')
row.insert(0, date)
row[0] = str(datetime.datetime.strptime(row[0],'%m/%d/%Y %H:%M'))
print ','.join(row)
Upvotes: 1
Reputation: 2505
It seems like the information in the file in allready sorted in the right order. What you can do is:
Write these "timestamp,13986513,13986534,..." to a file "timestamp.txt". Then grab the value "13986513,6,6/1/2014 12:00:00" in a global string. Then write line away to a file "volt.csv".
Every time the "13986513,6,6/1/2014 12:00:00" matches the previeus one you can add it to create "2014-06-01 12:00:00 PDT,248.7,249.3,...".
But every time you read 1 line and read that one line away. If you keep it in your memory then the program can't handle it anymore.
Have a look at the flush() function. I think you might need that one.
EDIT:
Example code:
class Ice():
def __init__(self):
self.Fire()
def Fire(self):
with open('File1.txt', 'r') as file1:
for line in file1:
# Do something.
print('Do something....')
# Save to file.
with open('File2.txt', 'a') as file2:
file2.write(line)
file2.flush()
file2.close()
file1.close()
# Run class
Ice()
The thing with big files is, is that is uses a lot of memory. So what you want is to read a line from the file. Process it. Write it away (out of the memory) and take the next line. This way you can process huge files.
What .flush() does is, is it flushes the output. Like in my example you write away a line, but python doesn't write it away right on the moment of the write(). It stores it in a memory. By using .flush() the output is writed to the file. And the line is not kept in memory.
By creating a temporary file you can process all lines without python using the max of its memory.
Upvotes: 0