Reputation: 613

Printing data normally without Array of arrays in spark RDD

I have below sample of data :

67832,CLARE,MANAGER,68319,1991-06-09,2550.00,,1001
65646,JONAS,MANAGER,68319,1991-04-02,2957.00,,2001

I want to get data where last column data is NOT EQUAL to 2001. So I tried below steps

1) Loaded data in RDD:

val employeesRdd=sc.textFile("file:///home/cloudera/Desktop/Employees/employees.txt").filter(p=>{p.length > 0 && p!=null})

2) Performed transformation:

If I modify my code like below then it gives array of array of strings but I want it to be printed like the normal data as the input dataset.

employeesRdd.map(_.split(",")).filter(p=>!(p(7)="2001")).collect

3) Even I tried mapping it with case class but got output as shown below:

case class employees(emp_id:java.lang.Long,emp_name:String, job_name:String,manager_id:java.lang.Long,hire_date:String,salary:java.lang.Double,commision:java.lang.Double,dep_id:java.lang.Long);

val employeesRdd1=employeesRdd.map(_.split(",")).map(p=>employees(if(p(0).length>0)p(0).toLong else 0L,p(1),p(2),if(p(3).length>0) p(3).toLong else 0L,p(4),if(p(5).length>0) p(5).toDouble else 0D, if(p(6).length>0) p(6).toDouble else 0D,if(p(7).length>0)p(7).toLong else 0L)).toDF()  

employeesRdd1.foreach(println):

SAMPLE OUTPUT OF DATA AFTER MAPPING:

employees(67832,CLARE,MANAGER,68319,1991-06-09,2550.00,,1001)
employees(65646,JONAS,MANAGER,68319,1991-04-02,2957.00,,2001)

How to access elements in such cases. I tried below sample code as well but it throws an error that _1 is not a member of String:

employeesRdd1.map(_._1).first

So the whole point is I want to print in a normal form like the input data but without records that have last column != 2001. So where am I going wrong ?? Or is it ok if data is printed in Array of Array form ?? Is it valid from certification perspective ?? Thanks in advance

Upvotes: 0

Answers (3)

RushHour

Reputation: 613

So Finally after hard fought battle I found the solution to my problem. Below is the working code:

1) Create case class:

case class employees(emp_id:java.lang.Long,emp_name:String, job_name:String,manager_id:java.lang.Long,hire_date:String,salary:java.lang.Double,commision:java.lang.Double,dep_id:java.lang.Long);

2) Load RDD:

val rdd=sc.textFile("file:////home/cloudera/Desktop/Employees/employees.txt").filter(p=>{p!=null && p.trim.length>0})

3) Map case class with the RDD:

val employeesDf=rdd.map(_.split(",")).map(p=>employees(if(p(0).length>0)p(0).toLong else 0L,p(1),p(2),if(p(3).length>0) p(3).toLong else 0L,
p(4),if(p(5).length>0) p(5).toDouble else 0D, if(p(6).length>0) p(6).toDouble else 0D,if(p(7).length>0)p(7).toLong else 0L))

4) Apply Transformation:

employeesDf.filter(_.dep_id!=2001).foreach(println)

I also appreciate the efforts of the other people who tried their best to help me out. Cheers !

Upvotes: 0

Ramesh Maharjan

Reputation: 41987

For simplicity you can just add one more && expression in your initial expression as

val employeesRdd=sc.textFile("file:///home/cloudera/Desktop/Employees/employees.txt").filter(p=>{p.length > 5 && p!=null && !p.substring(p.length-5).contains("2001")})
employeesRdd.foreach(println)

will give you

67832,CLARE,MANAGER,68319,1991-06-09,2550.00,,1001

from the given input and you don't have to go through all the case class stuffs because your final requirement is

the whole point is I want to print in a normal form like the input data but without records that have last column != 2001

I hope the answer is helpful

Upvotes: 1

sam.ban

Reputation: 248

You can try the below snippet:

val testRDD = spark.sparkContext.textFile("D://testsample.txt");

case class employees(emp_id: Long, emp_name: String, job_name: String, manager_id: 
Long, hire_date: String, salary: Double, dep_id: String);

val depRDD = testRDD.map(_.split(",")).map(p => employees(p(0).toLong, p(1), p(2), 
p(3).toLong, p(4), p(5).toDouble, p(6))).filter(!_.dep_id.equals("2001"));

depRDD.foreach(println)

This will give you all the rows which does not have dept_id 2001.The "!" operator in the filter function will return true when dept_id is not 2001. The sample input I consider is given below:

67832,CLARE,MANAGER,68319,1991-06-09,2550.00,1001
65646,JONAS,MANAGER,68319,1991-04-02,2957.00,2001
23459,SAMIK,MANAGER,68319,1991-08-12,2550.00,3001
67890,SUMAN,MANAGER,68319,1991-06-23,2957.00,2001

Upvotes: 0

Printing data normally without Array of arrays in spark RDD

Answers (3)

Related Questions