Cold Fish
Cold Fish

Reputation: 332

Adding a list to a dataframe in Scala / Spark such that each element is added to a separate row

say for example I have a dataframe in the following format (in reality is a lot more documents):

df.show()

//output
    +-----+-----+-----+
    |doc_0|doc_1|doc_2|
    +-----+-----+-----+
    |  0.0|  1.0|  0.0|
    +-----+-----+-----+
    |  0.0|  1.0|  0.0|
    +-----+-----+-----+
    |  2.0|  0.0|  1.0|
    +-----+-----+-----+

// ngramShingles is a list of shingles
println(ngramShingles)

//output
    List("the",  "he ", "e l")

Where the ngramShingles length is equal to the size of the dataframes columns.

How would I get to the following output?

// Desired Output
+-----+-----+-----+-------+
|doc_0|doc_1|doc_2|shingle|
+-----+-----+-----+-------+
|  0.0|  1.0|  0.0|  "the"|
+-----+-----+-----+-------+
|  0.0|  1.0|  0.0|  "he "|
+-----+-----+-----+-------+
|  2.0|  0.0|  1.0|  "e l"|
+-----+-----+-----+-------+

I have tried to add a column via the following line of code:

val finalDf = df.withColumn("shingle", typedLit(ngramShingles))

But that gives me this output:

+-----+-----+-----+-----------------------+
|doc_0|doc_1|doc_2|                shingle|
+-----+-----+-----+-----------------------+
|  0.0|  1.0|  0.0|  ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+
|  0.0|  1.0|  0.0|  ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+
|  2.0|  0.0|  1.0|  ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+

I have tried a few other solutions, but really nothing I have tried even comes close. Basically, I just want the new column to be added to each row in the DataFrame.

This question shows how to do this, but both answers rely on having a one column already existing. I don't think I can apply those answers to my situation where I have thousands of columns.

Upvotes: 2

Views: 1058

Answers (1)

Krzysztof Atłasik
Krzysztof Atłasik

Reputation: 22625

You could make dataframe from your list and then join two dataframes together. TO do join you'd need to add an additional column, that would be used for join (can be dropped later):

val listDf = List("the",  "he ", "e l").toDF("shingle")

val result = df.withColumn("rn", monotonically_increasing_id())
   .join(listDf.withColumn("rn", monotonically_increasing_id()), "rn")
   .drop("rn")

Result:

+-----+-----+-----+-------+
|doc_0|doc_1|doc_2|shingle|
+-----+-----+-----+-------+
|  0.0|  1.0|  0.0|    the|
|  0.0|  1.0|  0.0|    he |
|  2.0|  0.0|  1.0|    e l|
+-----+-----+-----+-------+

Upvotes: 1

Related Questions