screechOwl
screechOwl

Reputation: 28129

Spark Rename Dataframe Columns

I have 2 files in HDFS - one is a csv file with no header and one is a list of column names. I'm wondering if it's possible to assign the column names to the other data frame without actually typing them out like described here.

I'm looking for something like this:

val df = sqlContext.read.format("com.databricks.spark.csv").option("delimiter", "\t").load("/user/training_data.txt")
val header = sqlContext.read.format("com.databricks.spark.csv").option("delimiter", ",").load("/user/col_names.txt")

df.columns(header)

Is this possible?

Upvotes: 1

Views: 1331

Answers (1)

Alexey Svyatkovskiy
Alexey Svyatkovskiy

Reputation: 646

One way could be to read the header file using scala.io like this:

import scala.io.Source
val header = Source.fromFile("/user/col_names.txt").getLines.map(_.split(","))
val newNames = header.next

Then, read the CSV file using spark-csv as you do, specifying no header and converting the names like:

val df = spark.read.format("com.databricks.spark.csv")
         .option("header", "false").option("delimiter", "\t")
         .load("/user/training_data.txt").toDF(newNames: _*)

notice the _* type annotation.

The _* is type ascription in Scala (meaning that we can give a list as argument, and it will still work, applying the same function to each member of the-said list)

more here: What is the purpose of type ascriptions in Scala?

Upvotes: 2

Related Questions