Reputation: 853
I'm trying to merge couple of dataframes from HomeCredit Kaggle competion according to the data schema. I did following:
train = pd.read_csv('~/Documents/HomeCredit/application_train.csv')
bureau = pd.read_csv('~/Documents/HomeCredit/bureau.csv')
bureau_balance = pd.read_csv('~/Documents/HomeCredit/bureau_balance.csv')
train = train.merge(bureau,how='outer',left_on=['SK_ID_CURR'],right_on=['SK_ID_CURR'])
train = train.merge(bureau_balance,how='inner',left_on=['SK_ID_BUREAU'],right_on=['SK_ID_BUREAU'])
which fails at
MemoryError
for the second merge. The train data frame is of shape (308k,122), bureau (1.72M,12) and bureau_balance (27.3M,3). It is my understanding that an application from the train df does not have to have a record in burea table but all rows from that table should have a record in bureau_balance.
I'm running the code at my local instance with 16GB RAM.
Is there a way how to cope around the memory issue with such a large dataset?
Thanks in advance.
Upvotes: 0
Views: 214
Reputation: 1034
After a certain problem size pandas is not the appropriate tool. I would import the data in a relational database and issue SQL queries. Sqlalchemy is a nice python tool for working with databases.
Upvotes: 1