Python : Multiple file processing is very slow

Question

I have to read through 2 different types of files at the same time in order to synchronise their data. The files are generated in parallel with different frequencies.

File 1, which will be very big in size (>10 GB) has the structure as follows : DATA is a field containing 100 characters and the number that follows it is a synchronisation signal that is common for both files (i.e. they change at the same time in both files).

DATA 1
DATA 1
... another 4000 lines
DATA 1
DATA 0
... another 4000 lines and so on

File 2, small in size (at most 10 MB but more in number) has the same structure the difference being in the number of rows between the synchronisation signal change:

DATA 1
... another 300-400 lines
DATA 1
DATA 0
... and so on

Here is the code that I use to read the files:

def getSynchedChunk(fileHandler, lastSynch, end_of_file):

    line_vector = [];                         # initialize output array
    for line in fileHandler:                  # iterate over the file
        synch = int(line.split(';')[9]);      # get synch signal
        line_vector.append(line);         
        if synch != lastSynch:                # if a transition is detected
            lastSynch = synch;                # update the lastSynch variable for later use
            return (lastSynch, line_vector, True); # and exit - True = sycnh changed

     return (lastSynch, line_vector, False); # exit if end of file is reached

I have to synchronise the data chunks (the lines that have the same synch signal value) and write the new lines to another file. I am using Spyder.

For testing, I used smaller sized files, 350 MB for FILE 1 and 35 MB for FILE 2. I also used the built-in Profiler to see where is the most time spent and it seems that 28s out of 46s is spent in actually reading the data from the files. The rest is used in synchronising the data and writing to the new file.

If I scale the time up to files sized in gigs, it will take hours to finish the processing. I will try to change the way I do the processing to make it faster, but is there a faster way to read through big files?

One line of data looks like this :

01/31/19 08:20:55.886;0.049107050;-0.158385641;9.457415342;-0.025256720;-0.017626805;-0.000096349;0.107;-0.112;0

The values are sensor measurements. The last number is the synch value.

JE_Muc · Accepted Answer

I recommend reading in the whole files first and then do the processing. This has the huge advantage, that all the appending/concatenating etc. while reading is done internally with optimized modules. The synching can be done afterwards.

For this purpose I strongly recommend using pandas, which is imho by far the best tool to work with timeseries data like measurements.

Importing your files, guessing csv in a text file is the correct format, can be done with:

df = pd.read_csv(
    'DATA.txt', sep=';', header=None, index_col=0, 
    parse_dates=True, infer_datetime_format=True, dayfirst=True)

To reduce memory consumption, you can either specify a chunksize to split the file reading, or low_memory=True to internally split the file reading process (assuming that the final dataframe fits in your memory):

df = pd.read_csv(
    'DATA.txt', sep=';', header=None, index_col=0, 
    parse_dates=True, infer_datetime_format=True, dayfirst=True,
    low_memory=True)

Now your data will be stored in a DataFrame, which is perfect for time series. The index is already converted to a DateTimeIndex, which will allow for nice plotting, resampling etc. etc...

The sync state can now be easily accessed like in a numpy array (just adding the iloc accessing method) with:

df.iloc[:, 8]  # for all sync states
df.iloc[0, 8]  # for the first synch state
df.iloc[1, 8]  # for the second synch state

This is ideal for using fast vectorized synching of two or more files.

To read the file depending on the available memory:

try:
    df = pd.read_csv(
        'DATA.txt', sep=';', header=None, index_col=0, 
        parse_dates=True, infer_datetime_format=True, dayfirst=True)
except MemoryError:
    df = pd.read_csv(
        'DATA.txt', sep=';', header=None, index_col=0, 
        parse_dates=True, infer_datetime_format=True, dayfirst=True,
        low_memory=True)

This try/except solution might not be an elegant solution since it will take some time before the MemoryError is raised, but it is failsafe. And since low_memory=True will most probably reduce the file reading performance in most cases, the try block should be faster in most cases.

Python : Multiple file processing is very slow

Answers (2)

Related Questions