Reputation: 3930
I have this sample data frame:
val A = """[[15,["Printing Calculators"]],[13811,["Office Products"]]]"""
val B = """[[30888,["Paper & Printable Media"]],[223845,["Office Products"]]]"""
val C = """[[64,["Office Calculator Accessories"]]]"""
val df = List(A,B,C).toDF("bestseller_ranks")
And I would like to create a column which will look like this:
case class BestSellerRank(
Ranking: Integer,
Category: String
)
val A2 = List(new BestSellerRank(15,"Printing Calculators"),new BestSellerRank(13811,"Office Products"))
val B2 = List(new BestSellerRank(30888,"Paper & Printable Media"),new BestSellerRank(223845,"Office Products"))
val C2 =List(new BestSellerRank(64,"Office Calculator Accessories"))
val df2 = List(A2,B2,C2).toDF("bestseller_ranks_transformed")
I have tried to create UDF like this:
val BRUDF: UserDefinedFunction =
udf(
(bestseller_ranks: String) => {
bestseller_ranks.split(",").fold(List.empty[BestSellerRank])(v => new BestSellerRank(v._1, v._2))
}
)
But this seems to be total junk and I am stuck. Thanks for help!
Upvotes: 0
Views: 36
Reputation: 6323
I've tried to implement this without UDF. Perhaps this is helpful
val A = """[[15,["Printing Calculators"]],[13811,["Office Products"]]]"""
val B = """[[30888,["Paper & Printable Media"]],[223845,["Office Products"]]]"""
val C = """[[64,["Office Calculator Accessories"]]]"""
val df = List(A,B,C).toDF("bestseller_ranks")
df.show(false)
df.printSchema()
/**
* +------------------------------------------------------------------+
* |bestseller_ranks |
* +------------------------------------------------------------------+
* |[[15,["Printing Calculators"]],[13811,["Office Products"]]] |
* |[[30888,["Paper & Printable Media"]],[223845,["Office Products"]]]|
* |[[64,["Office Calculator Accessories"]]] |
* +------------------------------------------------------------------+
*
* root
* |-- bestseller_ranks: string (nullable = true)
*/
String
-> Array[Struct]
val p = df.withColumn("arr", split(
translate(
regexp_replace($"bestseller_ranks", """\]\s*,\s*\[""", "##"), "][", ""
), "##"
))
val processed = p.withColumn("bestseller_ranks_transformed", expr("TRANSFORM(arr, x -> " +
"named_struct('Ranking', cast(split(x, ',')[0] as int), 'Category', split(x, ',')[1]))"))
.select("bestseller_ranks", "bestseller_ranks_transformed")
processed.show(false)
processed.printSchema()
/**
* +------------------------------------------------------------------+-----------------------------------------------------------------+
* |bestseller_ranks |bestseller_ranks_transformed |
* +------------------------------------------------------------------+-----------------------------------------------------------------+
* |[[15,["Printing Calculators"]],[13811,["Office Products"]]] |[[15, "Printing Calculators"], [13811, "Office Products"]] |
* |[[30888,["Paper & Printable Media"]],[223845,["Office Products"]]]|[[30888, "Paper & Printable Media"], [223845, "Office Products"]]|
* |[[64,["Office Calculator Accessories"]]] |[[64, "Office Calculator Accessories"]] |
* +------------------------------------------------------------------+-----------------------------------------------------------------+
*
* root
* |-- bestseller_ranks: string (nullable = true)
* |-- bestseller_ranks_transformed: array (nullable = true)
* | |-- element: struct (containsNull = false)
* | | |-- Ranking: integer (nullable = true)
* | | |-- Category: string (nullable = true)
*/
Upvotes: 1
Reputation: 3930
Here is my solution :
val BRUDF: UserDefinedFunction =
udf(
(bestseller_ranks: String) => {
if (bestseller_ranks != null) {
bestseller_ranks.split("]],").map(v => v.replace("[","").replace("]]]","")).map(k => new BestSellerRank(k.split(",")(0).toInt,k.split(",")(1)))
}else{
null
}
}
)
Upvotes: 0