Reputation: 47
I have a csv file, as below
It has 6 rows with top row as header, while header read as "Students Marks" dataframe is treating them as one columns, now i want to separate both columns with data. "student" and "marks" are separated by space.
df.show()
_______________
##Student Marks##
---------------
A 10;20;10;20
A 20;20;30;10
B 10;10;10;10
B 20;20;20;10
B 30;30;30;20
Now i want to transform this csv table into two columns, with student and Marks, Also for every student the marks with add up, something like below
Student | Marks
A | 30;40;40;30
B | 60;60;60;40
I have tried with below but it is throwing an error
df.withColumn("_tmp", split($"Students Marks","\\ ")).select($"_tmp".getItem(0).as("col1"),$"_tmp".getItem(1).as("col2")).drop("_tmp")
Upvotes: 1
Views: 455
Reputation: 23099
You can read the csv file with the delimiteryou want and calculate result as below
val df = spark.read
.option("header", true)
.option("delimiter", " ")
.csv("path to csv")
After You get the dataframe df
val resultDF = df.withColumn("split", split($"Marks", ";"))
.withColumn("a", $"split"(0))
.withColumn("b", $"split"(1))
.withColumn("c", $"split"(2))
.withColumn("d", $"split"(3))
.groupBy("Student")
.agg(concat_ws(";", array(
Seq(sum($"a"), sum($"b"), sum($"c"), sum($"d")): _*)
).as("Marks"))
resultDF.show(false)
Output:
+-------+-------------------+
|Student|Marks |
+-------+-------------------+
|B |60.0;60.0;60.0;40.0|
|A |30.0;40.0;40.0;30.0|
+-------+-------------------+
Upvotes: 2
Reputation: 710
Three Ideas. The first one is to read the file, split it by space and then create the dataFrame:
val df = sqlContext.read
.format("csv")
.option("header", "true")
.option("delimiter", " ")
.load("your_file.csv")
The second one is to read the file to dataframe and split it:
df.withColumn("Student", split($"Students Marks"," ").getItem(0))
.withColumn("Marks", split($"Students Marks"," ").getItem(1))
.drop("Students Marks")
The last one is your solution. It should work, but when you use the select, you don't use $"_tmp", therefore, it should work without the .drop("_tmp")
df.withColumn("_tmp", split($"Students Marks"," "))
.select($"_tmp".getItem(0).as("Student"),$"_tmp".getItem(1).as("Marks"))
Upvotes: 1