Reputation: 2998

Extract particular column from CSV file in scala (Spark)

I have csv file like:

I want to extract only the column "Summary" out of the above file.

I wrote code:

val read_file2 = sc.textFile("/home/hari/sample_data_exp/extract_column_only.csv") 
read_file2.collect()

val tmp1 = read_file2.map { line => val parts = line.split(',')
parts.drop(3).take(1)

But this giving output as :

Many "Array()" is coming. I want only the values of that column "Summary", no blank arrays in between.

Upvotes: 1

Answers (5)

Himaprasoon

Reputation: 2659

You can use :

val read_file2 = sc.textFile("path")

read_file2.map(_.split(",")(3)).collect

If you want to fetch column values based on column names you can use spark-csv databricks library

val df=sqlContext.read.format("csv").option("header","true").load("pathToCSv")
df.select("columnName").collect() // here Summary

Upvotes: 0

Nicolas Rinaudo

Reputation: 6168

I'd recommend using a dedicated CSV library, since the CSV format has many surprising edge cases that a simple "read line by line, split by ," doesn't deal with.

There are various quality CSV libraries - scala-csv, purecsv, jackson-csv... I'm going to recommend kantan.csv because I'm the author and feel it's a good choice, but I readily admit to being biased.

Anyway, assuming you have the kantan.csv library in your classpath, here's how to do it (assuming content is a java.io.File):

import kantan.csv.ops._

content.asUnsafeCsvReader[List[String]](',', true).collect {
  case _ :: _ :: s :: _ => s
}

This turns your file into an iterator on CSV rows, where each row is represented as a List[String], and then maps each row into the value of its third column (rows that don't have three or more columns are ignored).

Upvotes: 1

Yuval Itzchakov

Reputation: 149538

If you want only the summary part without having intermediate arrays, but a single flat sequence, use flatMap:

val summaries = file.flatMap(_.split(',')(3))

But looking at the CSV, you'd probably want to retrieve some kind of identifier, so maybe a Tuple2[String, String] would be better:

val idToSummary = file.map(line => {
  val lines = line.split(',')
  (lines(2), lines(3))
})

Upvotes: 2

hasumedic

Reputation: 2167

If the file is not huge, you could load it in memory:

val tmp1 = file.map { line => line.split(',')(3) }

Or a bit more concise:

val tmp1 = file.map(_.split(',')(3))

Upvotes: 0

Zahiro Mor

Reputation: 1718

try

val tmp1 = read_file2.map(_.split(",")).map( p=>p(3)).take(100).foreach(println)

use p(0) for the first field and p(3) for the fourth one etc.

Upvotes: 0

Extract particular column from CSV file in scala (Spark)

Answers (5)

Related Questions