Reputation: 2998
I have csv file like:
I want to extract only the column "Summary" out of the above file.
I wrote code:
val read_file2 = sc.textFile("/home/hari/sample_data_exp/extract_column_only.csv")
read_file2.collect()
val tmp1 = read_file2.map { line => val parts = line.split(',')
parts.drop(3).take(1)
But this giving output as :
Many "Array()" is coming. I want only the values of that column "Summary", no blank arrays in between.
Upvotes: 1
Views: 12001
Reputation: 2659
You can use :
val read_file2 = sc.textFile("path")
read_file2.map(_.split(",")(3)).collect
If you want to fetch column values based on column names you can use spark-csv databricks library
val df=sqlContext.read.format("csv").option("header","true").load("pathToCSv")
df.select("columnName").collect() // here Summary
Upvotes: 0
Reputation: 6168
I'd recommend using a dedicated CSV library, since the CSV format has many surprising edge cases that a simple "read line by line, split by ," doesn't deal with.
There are various quality CSV libraries - scala-csv, purecsv, jackson-csv... I'm going to recommend kantan.csv because I'm the author and feel it's a good choice, but I readily admit to being biased.
Anyway, assuming you have the kantan.csv library in your classpath, here's how to do it (assuming content
is a java.io.File
):
import kantan.csv.ops._
content.asUnsafeCsvReader[List[String]](',', true).collect {
case _ :: _ :: s :: _ => s
}
This turns your file into an iterator on CSV rows, where each row is represented as a List[String]
, and then maps each row into the value of its third column (rows that don't have three or more columns are ignored).
Upvotes: 1
Reputation: 149538
If you want only the summary part without having intermediate arrays, but a single flat sequence, use flatMap
:
val summaries = file.flatMap(_.split(',')(3))
But looking at the CSV, you'd probably want to retrieve some kind of identifier, so maybe a Tuple2[String, String]
would be better:
val idToSummary = file.map(line => {
val lines = line.split(',')
(lines(2), lines(3))
})
Upvotes: 2
Reputation: 2167
If the file is not huge, you could load it in memory:
val tmp1 = file.map { line => line.split(',')(3) }
Or a bit more concise:
val tmp1 = file.map(_.split(',')(3))
Upvotes: 0
Reputation: 1718
try
val tmp1 = read_file2.map(_.split(",")).map( p=>p(3)).take(100).foreach(println)
use p(0) for the first field and p(3) for the fourth one etc.
Upvotes: 0