Reputation: 181
is there an equivalent on pyspark that allow me to do similar operation as in Pandas
pd.contact(df1, df2, Axis=1)
I have tried several methods so far none of them seems to work. the concatenation that it does is vertical, and I'm needing to concatenate multiple spark dataframes into 1 whole dataframe.
if I use union or unionAll it the dataframes get stacked vertically, as one single column which is not useful for my use case. I also have tried this example (did not work either):
from functools import reduce
from pyspark.sql import DataFrame
def unionAll(*dfs):
return reduce(DataFrame.unionAll, dfs)
any help will be greatly appreciated.
Upvotes: 3
Views: 2380
Reputation: 1304
The best way I have found is to join the dataframes using a unique id, and org.apache.spark.sql.functions.monotonically_increasing_id() happens to do the job
The following code in scala (would be the same in pyspark):
Set(df1, df2, df3).map(_.withColumn("id", monotonically_increasing_id()))
.reduce((a,b) => a.join(b, "id"))
Gives the horizontally concatenated dataframes.
Upvotes: 1