Reputation: 35
I need to compare two csv and do inner join .I am using vaex which is faster than pandas but got stuck after a point. my code was working with pandas but it was slow .How can I inner join two hdf5 type files and get the output in csv .
My code
vaex_df1 = vaex.from_csv(file1,convert=True, chunk_size=5_000)
vaex_df2 = vaex.from_csv(file2,convert=True, chunk_size=5_000)
vaex_df1 = vaex.open(file1+'.hdf5')
vaex_df2 = vaex.open(file2+'.hdf5')
print(type(vaex_df1),vaex_df1)
print(type(vaex_df2),vaex_df2)
df_join = pd.merge(vaex_df1,vaex_df2,how='inner',left_on ='CL_CLIENT_ID',right_on='CL_CLIENT_ID')
df_join.to_csv('C:\\Users\\abc\Desktop\\New folder\\file3.csv')
print("succes in compare")
As we do merge in pandas is there a way to inner join in vaex as I couldnt find much on internet. code gives error at point 'df_join=pd.merge' which is obvious .
Upvotes: 1
Views: 2183
Reputation: 11105
The vaex tutorial has a section on joining: https://vaex.io/docs/tutorial.html#Joining. The API looks identical to that of pandas. Try:
df_join = vaex_df1.join(vaex_df2,
how='inner',
left_on ='CL_CLIENT_ID',
right_on='CL_CLIENT_ID')
Upvotes: 1