Michael Discenza
Michael Discenza

Reputation: 3340

Reading TSV into Spark Dataframe with Scala API

I have been trying to get the databricks library for reading CSVs to work. I am trying to read a TSV created by hive into a spark data frame using the scala api.

Here is an example that you can run in the spark shell (I made the sample data public so it can work for you)

import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};

val sqlContext = new SQLContext(sc)
val segments = sqlContext.read.format("com.databricks.spark.csv").load("s3n://michaeldiscenza/data/test_segments")

The documentation says you can specify the delimiter but I am unclear about how to specify that option.

Upvotes: 31

Views: 63203

Answers (3)

You May also try to inferSchema and check for schema.

val df = spark.read.format("csv")
      .option("inferSchema", "true")
      .option("sep","\t")
      .option("header", "true")
      .load(tmp_loc)

   df.printSchema()

Upvotes: 0

Shaido
Shaido

Reputation: 28422

With Spark 2.0+ use the built-in CSV connector to avoid third party dependancy and better performance:

val spark = SparkSession.builder.getOrCreate()
val segments = spark.read.option("sep", "\t").csv("/path/to/file")

Upvotes: 36

Michael Discenza
Michael Discenza

Reputation: 3340

All of the option parameters are passed in the option() function as below:

val segments = sqlContext.read.format("com.databricks.spark.csv")
    .option("delimiter", "\t")
    .load("s3n://michaeldiscenza/data/test_segments")

Upvotes: 40

Related Questions