Reputation: 33
Given:
I read every column from the dataframe, and call the function with the column as parameter.
The output should be saved as a table. How can I achieve this?
Upvotes: 0
Views: 292
Reputation: 7207
If function return values of the same type, in Scala:
// functions
val mySplit = (value: String) => Array(value.split(","))
val mySplitUDF = udf(mySplit(_: String))
// data
val intialDF = sparkContext.parallelize(List("First,Second,Third")).toDF("initialColumn")
// transformations
val arrayDF = intialDF.select(mySplitUDF(col("initialColumn")).as("arrayColumn"))
val expodedDF = arrayDF.select(explode(col("arrayColumn")).as("explodedCol"))
val resultDF = expodedDF.select(
col("explodedCol").getItem(0).as("Col1"),
col("explodedCol").getItem(1).as("Col2"),
col("explodedCol").getItem(2).as("Col3")
)
resultDF.show(false)
Result is:
+-----+------+-----+
|Col1 |Col2 |Col3 |
+-----+------+-----+
|First|Second|Third|
+-----+------+-----+
On Python can be implemented in similar way
Upvotes: 1
Reputation: 3619
from pyspark.sql import Row
df = sc.parallelize(['a','b','c']).map(lambda row : Row(key=row)).toDF()
df.show()
:
+---+
|key|
+---+
| a|
| b|
| c|
+---+
:
def func (args):
# function that will return 5 multiple values
lista = Row(result=",".join([ args.key+str(x) for x in range(5)]))
return lista
new_table = df.rdd.map(func).toDF()
new_table.show()
:
+--------------+
| result|
+--------------+
|a0,a1,a2,a3,a4|
|b0,b1,b2,b3,b4|
|c0,c1,c2,c3,c4|
+--------------+
:
new_table.saveAsTable("results")
Upvotes: 1