Reputation: 574
I'm looking for a way to do this without a UDF, I am wondering if its possible. Lets say I have a DF as follows:
Buyer_name Buyer_state CoBuyer_name CoBuyers_state Price Date
Bob CA Joe CA 20 010119
Stacy IL Jamie IL 50 020419
... about 3 millions more rows...
And I want to turn it to:
Buyer_name Buyer_state Price Date
Bob CA 20 010119
Joe CA 20 010119
Stacy IL 50 020419
Jamie IL 50 020419
...
Edit: I could also,
Create two dataframes, removing "Buyer" columns from one, and "Cobuyer" columns from the other.
Rename dataframe with "Cobuyer" columns as "Buyer" columns.
Concatenate both dataframes.
Upvotes: 0
Views: 613
Reputation: 14399
This sounds like an unpivot operation to me which can be accomplished with the union
function in Scala:
val df = Seq(
("Bob", "CA", "Joe", "CA", 20, "010119"),
("Stacy", "IL", "Jamie", "IL", 50, "020419")
).toDF("Buyer_name", "Buyer_state", "CoBuyer_name", "CoBuyer_state", "Price", "Date")
val df_new = df.select("Buyer_name", "Buyer_state", "Price", "Date").union(df.select("CoBuyer_name", "CoBuyer_state", "Price", "Date"))
df_new.show
Thanks to Leo for providing the dataframe definition which I've re-used.
Upvotes: 2
Reputation: 22449
You can group struct(Buyer_name, Buyer_state)
and struct(CoBuyer_name, CoBuyer_state)
into an Array
which is then expanded using explode
, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("Bob", "CA", "Joe", "CA", 20, "010119"),
("Stacy", "IL", "Jamie", "IL", 50, "020419")
).toDF("Buyer_name", "Buyer_state", "CoBuyer_name", "CoBuyer_state", "Price", "Date")
df.
withColumn("Buyers", array(
struct($"Buyer_name".as("_1"), $"Buyer_state".as("_2")),
struct($"CoBuyer_name".as("_1"), $"CoBuyer_state".as("_2"))
)).
withColumn("Buyer", explode($"Buyers")).
select(
$"Buyer._1".as("Buyer_name"), $"Buyer._2".as("Buyer_state"), $"Price", $"Date"
).show
// +----------+-----------+-----+------+
// |Buyer_name|Buyer_state|Price| Date|
// +----------+-----------+-----+------+
// | Bob| CA| 20|010119|
// | Joe| CA| 20|010119|
// | Stacy| IL| 50|020419|
// | Jamie| IL| 50|020419|
// +----------+-----------+-----+------+
Upvotes: 4