Reputation: 509
suppose these are my CSV file:
21628000000;21650466094
21697098269;21653506459
21653000000;21624124815
21624124815;21650466094
21650466094;21650466094
21624124815;21697098269
21697098269;21628206459
21628000000;21624124815
21650466094;21628206459
21628000000;21628206459
I want to count the number of occurrences in the first column to have a result:
(21628000000,4)
(21697098269,2)
(21624124815,2)
(21650466094,2)
i've tried:
object CountOcc {
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("Word Count")
.setMaster("local")
val sc = new SparkContext(conf)
//loading text file into textFile object .(RDD)
val textFile = sc.textFile(args(0))
//read the line , split the line into words
val words = textFile.flatMap (line => line.split(";"))
val cols = words.map(_.trim)
println(s"${cols(0)}") //error
cols.foreach(println)
sc.stop()
}
}
I get an error org.apache.spark.rdd.RDD error [String] Does not take parameters
So I can't make cols(0) or cols(1),how I can have only the first column so I can calculate the occurrence?
Upvotes: 0
Views: 3580
Reputation: 1147
This scala job will work correctly print the first column of the CSV file.
import org.apache.spark.sql.SparkSession
object CountOcc {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.appName("Read CSV")
.getOrCreate()
val csvDF = spark.read.csv(args(0))
val firstColumnList = csvDF.map( x => x.getString(0))
firstColumnList.foreach(println(_))
spark.close
}
}
Hope it helps
Upvotes: 0
Reputation: 1718
try
val words = textFile.map (line => line.split(";")(0)).map(p=>(p,1)).reduceByKey(_+_).collect()
Upvotes: 4
Reputation: 509
I try
val words = textFile.flatMap (line => line.split(";")(1))
I get :
2
1
6
5
0
4
6
6
0
9
4
2
1
6
5
3
5
0
6
4
5
9
2
1
6.....
Upvotes: 0