Erik Barajas
Erik Barajas

Reputation: 193

Create new DataFrame with new rows depending in number of a column - Spark Scala

I have a DataFrame with the following data:

  num_cta   | n_lines
110000000000|   2
110100000000|   3
110200000000|   1

With that information, I need to create a new DF with different number of rows depending the value that comes over the n_lines column.

For example, for the first row of my DF (110000000000), the value of the n_lines column is 2. The result would have to be something like the following:

  num_cta   
110000000000
110000000000

For all the Dataframe example that I show, the result to get would have to be something like this:

  num_cta  
110000000000
110000000000
110100000000
110100000000
110100000000
110200000000

Is there a way to do that? And multiply a row n times, depending on the value of a column value?

Regards.

Upvotes: 0

Views: 1080

Answers (2)

Leo C
Leo C

Reputation: 22439

One approach would be to expand n_lines into an array with an UDF and explode it:

val df = Seq(
  ("110000000000", 2),
  ("110100000000", 3),
  ("110200000000", 1)
)toDF("num_cta", "n_lines")

def fillArr = udf(
  (n: Int) => Array.fill(n)(1)
)

val df2 = df.withColumn("arr", fillArr($"n_lines")).
  withColumn("a", explode($"arr")).
  select($"num_cta")

df2.show
+------------+
|     num_cta|
+------------+
|110000000000|
|110000000000|
|110100000000|
|110100000000|
|110100000000|
|110200000000|
+------------+

Upvotes: 1

Avishek Bhattacharya
Avishek Bhattacharya

Reputation: 6974

There is no off the shelve way to doing this. However you can try iterate over the dataframe and return a list of num_cta where the number of elements are equal to the corresponding n_lines.

Something like

 import spark.implicits._

 case class (num_cta:String) // output dataframe schema
 case class (num_cta:String, n_lines:Integer) // input dataframe 'df' schema

val result =  df.flatmap(x => {
     List.fill(x.n_lines)(x.num_cta) 
 }).toDF

Upvotes: 1

Related Questions