Reputation: 165
I am developing Spark using Scala, and I don't have any background of Scala. I don't get the ValueError Yet, but I am preparing the ValueError Handler for my code.
|location|arrDate|deptDate|
|JFK |1201 |1209 |
|LAX |1208 |1212 |
|NYC | |1209 |
|22 |1201 |1209 |
|SFO |1202 |1209 |
If we have data like this, I would like to store Third row and Fourth row into Error.dat then process the fifth row again. In the error log, I would like to put the information of the data such as which file, the number of the row, and details of error. For logger, I am using log4j now.
What is the best way to implement that function? Can you guys help me?
Upvotes: 0
Views: 153
Reputation: 11587
I am assuming all the three columns are type String. in that case I would solve this using the below snippet. I have created two udf to check for the error records.
isNumber
]isEmpty
]code snippet
import org.apache.spark.sql.functions.row_number
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.udf
val df = rdd.zipWithIndex.map({case ((x,y,z),index) => (index+1,x,y,z)}).toDF("row_num", "c1", "c2", "c3")
val isNumber = udf((x: String) => x.replaceAll("\\d","") == "")
val isEmpty = udf((x: String) => x.trim.length==0)
val errDF = df.filter(isNumber($"c1") || isEmpty($"c2"))
val validDF = df.filter(!(isNumber($"c1") || isEmpty($"c2")))
scala> df.show()
+-------+---+-----+-----+
|row_num| c1| c2| c3|
+-------+---+-----+-----+
| 1|JFK| 1201| 1209|
| 2|LAX| 1208| 1212|
| 3|NYC| | 1209|
| 4| 22| 1201| 1209|
| 5|SFO| 1202| 1209|
+-------+---+-----+-----+
scala> errDF.show()
+-------+---+----+----+
|row_num| c1| c2| c3|
+-------+---+----+----+
| 3|NYC| |1209|
| 4| 22|1201|1209|
+-------+---+----+----+
Upvotes: 1