Wendy Velasquez
Wendy Velasquez

Reputation: 181

Horizontal concatenation in Pyspark

is there an equivalent on pyspark that allow me to do similar operation as in Pandas

pd.contact(df1, df2, Axis=1)

I have tried several methods so far none of them seems to work. the concatenation that it does is vertical, and I'm needing to concatenate multiple spark dataframes into 1 whole dataframe.

if I use union or unionAll it the dataframes get stacked vertically, as one single column which is not useful for my use case. I also have tried this example (did not work either):

from functools import reduce  
from pyspark.sql import DataFrame

def unionAll(*dfs):
     return reduce(DataFrame.unionAll, dfs) 

any help will be greatly appreciated.

Upvotes: 3

Views: 2380

Answers (1)

Brown nightingale
Brown nightingale

Reputation: 1304

The best way I have found is to join the dataframes using a unique id, and org.apache.spark.sql.functions.monotonically_increasing_id() happens to do the job

The following code in scala (would be the same in pyspark):

Set(df1, df2, df3).map(_.withColumn("id", monotonically_increasing_id()))
                  .reduce((a,b) => a.join(b, "id"))

Gives the horizontally concatenated dataframes.

Upvotes: 1

Related Questions