Outer join in Python for thousands of large tables

Question

So, I have some 4,000 CSV files and I need to outer join all of them. Each file has two columns (a string and a float) and between 10,000-1,000,000 rows and I want to join by the first column (i.e., the string variable).

I tried numpy.lib.recfunctions.join_by, but that was painfully slow. I switched to pandas.merge and that was a lot faster, but still too slow given the number (and size) of tables that I have. And it seems really memory intensive - to the point where the machine becomes unusable when the file being merged has hundreds of thousands of rows (I'm mostly using a MacBook Pro, 2.4GHz, 4GB).

So now I'm looking for alternatives - are there other potential solutions that I'm missing? What other outer join implementations exist for Python? Is there a paper/site somewhere that discusses and compares the time complexity of each implementation? Would it be more efficient if I simply had Python call, say, sqlite3, and then have sqlite3 do the join? Is the string key the issue? If I could use a numerical key instead, should it be any faster?

In case it helps give you a more concrete idea of what I'm trying to achieve, here is my code using pandas.merge:

import os
import pandas as pd

def load_and_merge(file_names, path_to_files, columns):
    '''
    seq, str, dict -> pandas.DataFrame
    '''
    output = pd.DataFrame(columns = ['mykey']) # initialize output DataFrame
    for file in file_names:

        # load new data
        new_data = pd.read_csv(path + file,
                               usecols = [col for col in columns.keys()],
                               dtype = columns,
                               names = ['mykey', file.replace('.csv', '')],
                               header = None)

        # merge with previous data
        output = pd.merge(output, new_data, on = 'mykey', how = 'outer')
        output = output.fillna(0) # kill NaNs

    return output

path = '/Users/username/data/'
files_list = [file for file in os.listdir(path) if '.csv' in file]
merged_table = load_and_merge(files_list, path, {0: 'S30', 1: 'float'})

(Mac OS X 10.6.8 & Python 2.7.5; Ubuntu 12.04 & Python 2.7.3)

Jeff · Accepted Answer

Here is how I would approach this problem.

Don't merge iteratively. You are merging a smallish frame (call this the 'mergee') with a larger frame (call this the 'merger'). Then repeating this, causing the 'merger' to get bigger and have more rows.

Instead you can do repeated hierarchical merges. Say you number the mergees 1-4000.

merge 1 and 2 to form 1_2

Then repeat, so then you merge 1_2 and 3_4 to form 1_2_3_4

This doesn't change the amount of work you are doing, but it makes each merge much simpler, which lowers the memory barrier and spent time (as it has to go thru the cartesian product of the keys). It may make sense to randomize the merge order.

In addition this is completely parallizeable, in that you can have independent processes work on this problem, generating intermediate merges. This essentially becomes a map-reduce type of problem.

You can also store the intermediate merges in HDF5 files (using HDFStore) which will make the storage quite efficient. Note that these should be SEPARATE files to avoid writing to the same file with multiple processes (which is not supported by HDF5).

Outer join in Python for thousands of large tables

Answers (2)

Related Questions