Reputation: 181
I need to merge multiple columns of a dataframe into one single column with list(or tuple) as the value for the column using pyspark in python.
Input dataframe:
+-------+-------+-------+-------+-------+
| name |mark1 |mark2 |mark3 | Grade |
+-------+-------+-------+-------+-------+
| Jim | 20 | 30 | 40 | "C" |
+-------+-------+-------+-------+-------+
| Bill | 30 | 35 | 45 | "A" |
+-------+-------+-------+-------+-------+
| Kim | 25 | 36 | 42 | "B" |
+-------+-------+-------+-------+-------+
Output dataframe should be
+-------+-----------------+
| name |marks |
+-------+-----------------+
| Jim | [20,30,40,"C"] |
+-------+-----------------+
| Bill | [30,35,45,"A"] |
+-------+-----------------+
| Kim | [25,36,42,"B"] |
+-------+-----------------+
Upvotes: 16
Views: 40348
Reputation: 2085
You can do it in a select like following:
from pyspark.sql.functions import *
df.select( 'name' ,
concat(
col("mark1"), lit(","),
col("mark2"), lit(","),
col("mark3"), lit(","),
col("Grade")
).alias('marks')
)
If [ ] necessary, it can be added lit function.
from pyspark.sql.functions import *
df.select( 'name' ,
concat(lit("["),
col("mark1"), lit(","),
col("mark2"), lit(","),
col("mark3"), lit(","),
col("Grade"), lit("]")
).alias('marks')
)
Upvotes: 1
Reputation: 461
Columns can be merged with sparks array function:
import pyspark.sql.functions as f
columns = [f.col("mark1"), ...]
output = input.withColumn("marks", f.array(columns)).select("name", "marks")
You might need to change the type of the entries in order for the merge to be successful
Upvotes: 26
Reputation: 145
If this is still relevant, you can use StringIndexer to encode your string values to float substitutes.
Upvotes: 0
Reputation: 160
look at this doc : https://spark.apache.org/docs/2.1.0/ml-features.html#vectorassembler
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=["mark1", "mark2", "mark3"],
outputCol="marks")
output = assembler.transform(dataset)
output.select("name", "marks").show(truncate=False)
Upvotes: 3