ASH
ASH

Reputation: 20362

Trying to Merge or Concat two pyspark.sql.dataframe.DataFrame in Databricks Environment

I have two dataframes in Azure Databricks. Both are of type: pyspark.sql.dataframe.DataFrame

The number of rows are the same; indexes are the same. I thought one of these code snippets, below, would do the job.

First Attempt:

result = pd.concat([df1, df2], axis=1)


Error Message: TypeError: cannot concatenate object of type "<class 'pyspark.sql.dataframe.DataFrame'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid

Second Attempt:

result = pd.merge(df1, df2, left_index=True, right_index=True)

Error Message:  TypeError: Can only merge Series or DataFrame objects, a <class 'pyspark.sql.dataframe.DataFrame'> was passed

Upvotes: 0

Views: 5140

Answers (2)

Nija I Pillai
Nija I Pillai

Reputation: 1136

I faced similar issue when combining two dataframes of same columns.

df = pd.concat([df, resultant_df], ignore_index=True)
TypeError: cannot concatenate object of type '<class 'pyspark.sql.dataframe.DataFrame'>'; only Series and DataFrame objs are valid

Then I tried join(), but it appends columns multiple times and returns empty dataframe.

df.join(resultant_df)

After that I used union(), gets the exact result.

df = df.union(resultant_df)
df.show()

It works fine in my case.

Upvotes: 2

ASH
ASH

Reputation: 20362

I ended up converting the two objects to pandas dataframes and then did the merge using the technique I know how to use.

Step #1:

df1= df1.select("*").toPandas()
df2= df2.select("*").toPandas()

Step #2:

result = pd.concat([df1, df2], axis=1)

Done!

Upvotes: 3

Related Questions