Reputation: 721
Reading the parquet files
df_ss_parq = dd.read_parquet("trainSearchStream.parquet/")
df_ai_parq = dd.read_parquet("AdsInfo.parquet/")
Merging the two datasets
df_train =(df_ss_parq.merge(df_ai,on="ad_id",how="left")
).compute()
RAM: 16 GB
I have tried using an index on column "ad_id" which makes it faster but shows the same error.
trainSearchStream size = 17 GB
AdsInfo size = 17 GB
Anybody have any idea how to solve it?
Upvotes: 1
Views: 384
Reputation: 105471
I would suggest using a SQL engine like Impala or Drill to do the join, writing the result out to new Parquet files. The Python data stack is not very well-suited right now for handling joins between large tables in a memory-constrained environment.
Upvotes: 1