Reputation: 509

how to extract a column of CSV file in scala spark rdd

suppose these are my CSV file:

21628000000;21650466094
21697098269;21653506459
21653000000;21624124815
21624124815;21650466094
21650466094;21650466094
21624124815;21697098269
21697098269;21628206459
21628000000;21624124815
21650466094;21628206459
21628000000;21628206459

I want to count the number of occurrences in the first column to have a result:

(21628000000,4)
(21697098269,2)
(21624124815,2)
(21650466094,2)

i've tried:

object CountOcc {
  def main(args: Array[String]) {
    val conf = new SparkConf()
    .setAppName("Word Count")
    .setMaster("local")

    val sc = new SparkContext(conf)

    //loading text file into textFile object .(RDD)
    val textFile = sc.textFile(args(0))

   //read the line , split the line into words
    val words = textFile.flatMap (line => line.split(";"))
    val cols = words.map(_.trim)
    println(s"${cols(0)}") //error
    cols.foreach(println)
   sc.stop()

  }
}

I get an error org.apache.spark.rdd.RDD error [String] Does not take parameters

So I can't make cols(0) or cols(1),how I can have only the first column so I can calculate the occurrence?

Upvotes: 0

Answers (3)

maxmithun

Reputation: 1147

This scala job will work correctly print the first column of the CSV file.

import org.apache.spark.sql.SparkSession

object CountOcc {
  def main(args: Array[String]) {
    val spark = SparkSession.builder()
      .appName("Read CSV")
      .getOrCreate()

    val csvDF = spark.read.csv(args(0))

    val firstColumnList = csvDF.map( x => x.getString(0))

    firstColumnList.foreach(println(_))

    spark.close
  }
}

Hope it helps

Upvotes: 0

Zahiro Mor

Reputation: 1718

try

val words = textFile.map (line => line.split(";")(0)).map(p=>(p,1)).reduceByKey(_+_).collect()

Upvotes: 4

G.Saleh

Reputation: 509

I try

val words = textFile.flatMap (line => line.split(";")(1))

I get :

Upvotes: 0

how to extract a column of CSV file in scala spark rdd

Answers (3)

Related Questions