Efficiently write millions of lines to a file using Python dataframe

Question

I have the following code snippet that reads a CSV into a dataframe, and writes out key-values pairs to a file in a Redis protocol-compliant fashion, i.e. SET key1 value1. The code is piecemeal and I have tried to use multiprocessing, though I am not sure of its performance (gains).

The CSV has about 6 million lines, that is read into a dataframe pretty quickly (under 2 minutes). The output file has 12 million lines (2 lines per line of the input file). This takes about 50 minutes to complete. Can any part of my code be optimized/changed to make this run faster? Once the file is complete, loading it to Redis takes less than 90 seconds. The bottleneck really is in writing to the file. I will have several such files to write and spending 50-60 minutes per file is really not ideal. This particular dataset has 6 million rows and 10 columns, mostly comprised of strings with a few float columns. The Redis keys are the strings and the float values are the Redis values in the key-value pair. Other datasets will be similarly sized, if not bigger (both with respect to rows and columns).

I was looking into loading all the strings I generate into a dataframe and then use the to_csv() function to dump it to a file, but I'm not sure of how its performance will be.

filepath = '/path/to/file.csv'

def df_to_file:
    df = pd.read_csv(filepath)
    f = open('output_file', 'w')
    for i in range(len(df.index)):
        if df['col1'].iloc[i] != '':
            key1 = str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
            val1 = df['col_n+1'].iloc[i]

            key1a = str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
            val1a = df['col_n+2'].iloc[i]

            print('SET {0} {1}
SET {0} {1}'.format(key1, val1, key1a, val1a), file = f)

        if df['col2'].iloc[i] != '':
            key1 = str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
            val1 = df['col_n+1'].iloc[i]

            key1a = str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
            val1a = df['col_n+2'].iloc[i]

            print('SET {0} {1}
SET {0} {1}'.format(key1, val1, key1a, val1a), file = f)
        if df['col3'].iloc[i] != '':
            key1 = str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
            val1 = df['col_n+1'].iloc[i]

            key1a = str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
            val1a = df['col_n+2'].iloc[i]

            print('SET {0} {1}
SET {0} {1}'.format(key1, val1, key1a, val1a), file = f)
    f.close()

p = Process(target = df_to_file)
p.start()
p.join()

juanpa.arrivillaga · Accepted Answer

Using a construction like df['col1'].loc[...] to loop over individual rows is going to be slow, the iloc and loc based selectors are for selecting across entire data-frames, and do a lot of stuff related to index-alignment that will have high overhead if done for each row. Instead, simply using df.itertuples() to iterate over rows will be significantly faster.

def df_to_file:
    df = pd.read_csv(filepath)
    f = open('output_file', 'wb') # writing in binary mode should be faster, if it is possible without unicode problems
    for row in df.itertuples():
        if row.col1:
            key1, val1 = string1, string2
            key1a, val1a = string1a, string2a
            print('SET {0} {1}
SET {0} {1}'.format(key1, val1, key1a, val1a), file = f)
        if row.col2:
            key1, val1 = string1, string2
            key1a, val1a = string1a, string2a
            print('SET {0} {1}
SET {0} {1}'.format(key1, val1, key1a, val1a), file = f)
        if row.col3:
            key1, val1 = string1, string2
            key1a, val1a = string1a, string2a
            print('SET {0} {1}
SET {0} {1}'.format(key1, val1, key1a, val1a), file = f)
    f.close()

This is perhaps the bare-minimum optimization you could make. If you described in more detail exactly what you are doing, perhaps a vectorized solution could be found.

Also, don't use the above with multiprocessing.

Also, as written, 'SET {0} {1} SET {0} {1}'.format(key1, val1, key1a, val1a) will always be the same. If those parameters aren't changing, then simply do the string concatenation once outside the loop and re-use the whole string in the loop.

Edit: Seems you can't do that above However, given:

This particular dataset has 6 million rows and 10 columns, mostly comprised of strings with a few float columns. The Redis keys are the strings and the float values are the Redis values in the key-value pair.

Then simply key1 = ''.join(row.col1, row.col4, row.col5, ...) Don't use str and the + operator, this is horribly inefficient, doubly so since you imply those columns are already strings. If you must call str on all those columns, use map(str, ...)

Finally, if you really need to squeeze performance out, note that row will be namedtuple objects, which are tuples, and you cna use integer-based indexing instead of attribute-based label access, i.e. row[1] instead of row.col1 (note, row[0] will be row.index, i.e. the index)` which should be faster (and it will make a difference since you are indexing into the tuple dozens of times per iterations and doing millions of iterations).

Efficiently write millions of lines to a file using Python dataframe

Answers (1)

Related Questions