Reputation: 143
I have to read through 2 different types of files at the same time in order to synchronise their data. The files are generated in parallel with different frequencies.
File 1, which will be very big in size (>10 GB) has the structure as follows : DATA is a field containing 100 characters and the number that follows it is a synchronisation signal that is common for both files (i.e. they change at the same time in both files).
DATA 1
DATA 1
... another 4000 lines
DATA 1
DATA 0
... another 4000 lines and so on
File 2, small in size (at most 10 MB but more in number) has the same structure the difference being in the number of rows between the synchronisation signal change:
DATA 1
... another 300-400 lines
DATA 1
DATA 0
... and so on
Here is the code that I use to read the files:
def getSynchedChunk(fileHandler, lastSynch, end_of_file):
line_vector = []; # initialize output array
for line in fileHandler: # iterate over the file
synch = int(line.split(';')[9]); # get synch signal
line_vector.append(line);
if synch != lastSynch: # if a transition is detected
lastSynch = synch; # update the lastSynch variable for later use
return (lastSynch, line_vector, True); # and exit - True = sycnh changed
return (lastSynch, line_vector, False); # exit if end of file is reached
I have to synchronise the data chunks (the lines that have the same synch signal value) and write the new lines to another file. I am using Spyder.
For testing, I used smaller sized files, 350 MB for FILE 1 and 35 MB for FILE 2. I also used the built-in Profiler to see where is the most time spent and it seems that 28s out of 46s is spent in actually reading the data from the files. The rest is used in synchronising the data and writing to the new file.
If I scale the time up to files sized in gigs, it will take hours to finish the processing. I will try to change the way I do the processing to make it faster, but is there a faster way to read through big files?
One line of data looks like this :
01/31/19 08:20:55.886;0.049107050;-0.158385641;9.457415342;-0.025256720;-0.017626805;-0.000096349;0.107;-0.112;0
The values are sensor measurements. The last number is the synch value.
Upvotes: 3
Views: 1098
Reputation: 5774
I recommend reading in the whole files first and then do the processing. This has the huge advantage, that all the appending/concatenating etc. while reading is done internally with optimized modules. The synching can be done afterwards.
For this purpose I strongly recommend using pandas
, which is imho by far the best tool to work with timeseries data like measurements.
Importing your files, guessing csv
in a text file is the correct format, can be done with:
df = pd.read_csv(
'DATA.txt', sep=';', header=None, index_col=0,
parse_dates=True, infer_datetime_format=True, dayfirst=True)
To reduce memory consumption, you can either specify a chunksize
to split the file reading, or low_memory=True
to internally split the file reading process (assuming that the final dataframe fits in your memory):
df = pd.read_csv(
'DATA.txt', sep=';', header=None, index_col=0,
parse_dates=True, infer_datetime_format=True, dayfirst=True,
low_memory=True)
Now your data will be stored in a DataFrame
, which is perfect for time series. The index is already converted to a DateTimeIndex, which will allow for nice plotting, resampling etc. etc...
The sync
state can now be easily accessed like in a numpy array (just adding the iloc
accessing method) with:
df.iloc[:, 8] # for all sync states
df.iloc[0, 8] # for the first synch state
df.iloc[1, 8] # for the second synch state
This is ideal for using fast vectorized synching of two or more files.
To read the file depending on the available memory:
try:
df = pd.read_csv(
'DATA.txt', sep=';', header=None, index_col=0,
parse_dates=True, infer_datetime_format=True, dayfirst=True)
except MemoryError:
df = pd.read_csv(
'DATA.txt', sep=';', header=None, index_col=0,
parse_dates=True, infer_datetime_format=True, dayfirst=True,
low_memory=True)
This try/except
solution might not be an elegant solution since it will take some time before the MemoryError is raised, but it is failsafe. And since low_memory=True
will most probably reduce the file reading performance in most cases, the try
block should be faster in most cases.
Upvotes: 1
Reputation: 77
I'm not used to Spyder but you can try to use multithreading for chunking the big files, Python has an option for this without any external library so it will probably work with Spyder as well. (https://docs.python.org/3/library/threading.html)
The process of chunking:
Upvotes: 1