Manu Chadha
Manu Chadha

Reputation: 16723

can I specify column names when creating a DataFrame

My data is in a csv file. The file hasn't got any header column

United States   Romania 15
United States   Croatia 1
United States   Ireland 344
Egypt   United States   15

If I read it, Spark creates names for the columns automatically.

scala> val data = spark.read.csv("./data/flight-data/csv/2015-summary.csv")
data: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 1 more field]

Is it possible to provide my own names for the columns when reading the file if I don't want to use _c0, _c1? For eg, I want spark to use DEST, ORIG and count for column names. I don't want to add header row in the csv to do this

Upvotes: 1

Views: 1777

Answers (2)

Md Shihab Uddin
Md Shihab Uddin

Reputation: 561

It's better to define schema (StructType) first, then load the csv data using the schema.

Here is how to define schema:

import org.apache.spark.sql.types._
val schema = StructType(Array(
      StructField("DEST",StringType,true),
      StructField("ORIG",StringType,true),
      StructField("count",IntegerType,true)
    ))

Load the dataframe:

val df = spark.read.schema(schema).csv("./data/flight-data/csv/2015-summary.csv")

Hopefully it'll help you.

Upvotes: 0

Kaushal
Kaushal

Reputation: 3367

Yes you can, There is a way, You can us toDF function of dataframe.

val data = spark.read.csv("./data/flight-data/csv/2015-summary.csv").toDF("DEST", "ORIG", "count")

Upvotes: 2

Related Questions