Reputation: 332
say for example I have a dataframe in the following format (in reality is a lot more documents):
df.show()
//output
+-----+-----+-----+
|doc_0|doc_1|doc_2|
+-----+-----+-----+
| 0.0| 1.0| 0.0|
+-----+-----+-----+
| 0.0| 1.0| 0.0|
+-----+-----+-----+
| 2.0| 0.0| 1.0|
+-----+-----+-----+
// ngramShingles is a list of shingles
println(ngramShingles)
//output
List("the", "he ", "e l")
Where the ngramShingles
length is equal to the size of the dataframes columns.
How would I get to the following output?
// Desired Output
+-----+-----+-----+-------+
|doc_0|doc_1|doc_2|shingle|
+-----+-----+-----+-------+
| 0.0| 1.0| 0.0| "the"|
+-----+-----+-----+-------+
| 0.0| 1.0| 0.0| "he "|
+-----+-----+-----+-------+
| 2.0| 0.0| 1.0| "e l"|
+-----+-----+-----+-------+
I have tried to add a column via the following line of code:
val finalDf = df.withColumn("shingle", typedLit(ngramShingles))
But that gives me this output:
+-----+-----+-----+-----------------------+
|doc_0|doc_1|doc_2| shingle|
+-----+-----+-----+-----------------------+
| 0.0| 1.0| 0.0| ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+
| 0.0| 1.0| 0.0| ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+
| 2.0| 0.0| 1.0| ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+
I have tried a few other solutions, but really nothing I have tried even comes close. Basically, I just want the new column to be added to each row in the DataFrame.
This question shows how to do this, but both answers rely on having a one column already existing. I don't think I can apply those answers to my situation where I have thousands of columns.
Upvotes: 2
Views: 1058
Reputation: 22625
You could make dataframe from your list and then join two dataframes together. TO do join you'd need to add an additional column, that would be used for join (can be dropped later):
val listDf = List("the", "he ", "e l").toDF("shingle")
val result = df.withColumn("rn", monotonically_increasing_id())
.join(listDf.withColumn("rn", monotonically_increasing_id()), "rn")
.drop("rn")
Result:
+-----+-----+-----+-------+
|doc_0|doc_1|doc_2|shingle|
+-----+-----+-----+-------+
| 0.0| 1.0| 0.0| the|
| 0.0| 1.0| 0.0| he |
| 2.0| 0.0| 1.0| e l|
+-----+-----+-----+-------+
Upvotes: 1