How to perform a merge of (too) large dataframes?

Question

I'm trying to merge couple of dataframes from HomeCredit Kaggle competion according to the data schema. I did following:

train = pd.read_csv('~/Documents/HomeCredit/application_train.csv')
bureau = pd.read_csv('~/Documents/HomeCredit/bureau.csv')
bureau_balance = pd.read_csv('~/Documents/HomeCredit/bureau_balance.csv')

train = train.merge(bureau,how='outer',left_on=['SK_ID_CURR'],right_on=['SK_ID_CURR'])
train = train.merge(bureau_balance,how='inner',left_on=['SK_ID_BUREAU'],right_on=['SK_ID_BUREAU'])

which fails at

MemoryError

for the second merge. The train data frame is of shape (308k,122), bureau (1.72M,12) and bureau_balance (27.3M,3). It is my understanding that an application from the train df does not have to have a record in burea table but all rows from that table should have a record in bureau_balance.

I'm running the code at my local instance with 16GB RAM.

Is there a way how to cope around the memory issue with such a large dataset?

Thanks in advance.

Mihai Andrei · Accepted Answer

After a certain problem size pandas is not the appropriate tool. I would import the data in a relational database and issue SQL queries. Sqlalchemy is a nice python tool for working with databases.

How to perform a merge of (too) large dataframes?

Answers (1)

Related Questions