Ricky
Ricky

Reputation: 2750

How to filter data based on three columns in Scala

I am a newbie to scala and I want to iterate through three for loos for one data set and perform some analysis.For example my data is like below:

Sample.csv

1,100,0,NA,0,1,0,Friday,1,5
1,100,0,NA,0,1,0,Wednesday,1,9
1,100,1,NA,0,1,0,Friday,1,5
1,100,2,NA,0,1,0,Friday,1,5
1,101,0,NA,0,1,0,Friday,1,5
1,101,1,NA,0,1,0,Friday,1,5
1,101,2,NA,0,1,0,Friday,1,5
1,102,0,NA,0,1,0,Friday,1,5
1,102,1,NA,0,1,0,Friday,1,5
1,102,2,NA,0,1,0,Friday,1,5

So now I read the like below:

val data = sc.textFile("C:/users/ricky/Data.csv")

Now I need to implement a filter for first three columns in scala to filter the subset of the whole data and do some analysis.For example The first three columns are the columns to be filtered. So I have one value for 1st column (1),3 values for second column (100,101,102) and 3 values for third column (0,1,2).So now I need to run the filter to provide subset of the whole data as .Is it good to use loops like below

for {
  i <- 1
  j <- 100 to 102
  k <- 1 to 2
}

which should need subset data like

1,100,0,NA,0,1,0,Friday,1,5
1,100,0,NA,0,1,0,Wednesday,1,9

where i=1 ,j=100,and k=0

and up to

1,102,2,NA,0,1,0,Friday,1,5

where i=1 ,j=102,and k=2

How can I run for the data (which I read from CSV) in Scala.

Upvotes: 0

Views: 3121

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41987

After you read it from text csv file, you can use filter to filter in the data you want as

val tempData = data.map(line => line.split(","))
tempData.filter(array => array(0) == "1" && array(1) == "100" && array(2) == "0").foreach(x => println(x.mkString(",")))

This will give you result as

1,100,0,NA,0,1,0,Friday,1,5
1,100,0,NA,0,1,0,Wednesday,1,9

You can do the same for the rest of the cases

Dataframe apis

You can use dataframe api for simplicity, optimized than rdd and many more. First step would be to read the csv to dataframe as

val df = sqlContext.read.format("com.databricks.spark.csv").load("path to csv file")

you will have

+---+---+---+---+---+---+---+---------+---+---+
|_c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7      |_c8|_c9|
+---+---+---+---+---+---+---+---------+---+---+
|1  |100|0  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |100|0  |NA |0  |1  |0  |Wednesday|1  |9  |
|1  |100|1  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |100|2  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |101|0  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |101|1  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |101|2  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |102|0  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |102|1  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |102|2  |NA |0  |1  |0  |Friday   |1  |5  |
+---+---+---+---+---+---+---+---------+---+---+

Then you can use filter api as in rdd as

import sqlContext.implicits._
val df1 = df.filter($"_c0" === "1" && $"_c1" === "100" && $"_c2" === "0")

you should have

+---+---+---+---+---+---+---+---------+---+---+
|_c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7      |_c8|_c9|
+---+---+---+---+---+---+---+---------+---+---+
|1  |100|0  |NA |0  |1  |0  |Friday   |1  |5  |
|1  |100|0  |NA |0  |1  |0  |Wednesday|1  |9  |
+---+---+---+---+---+---+---+---------+---+---+

You can even define schema to have the column names as you want.

Edited

answering your comment below, it all depends on what you ouput

scala> val temp = tempData.filter(array => array(0) == "1" && array(1).toInt == "100" && array(2).toInt == "0").map(x => x.mkString(","))
temp: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at map at <console>:28

scala> tempData.filter(array => array(0) == "1" && array(1).toInt == "100" && array(2).toInt == "0")
res9: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[13] at filter at <console>:29

I hope its clear.

Upvotes: 2

Related Questions