Reputation: 369
I am doing a spark project and needs advise on how solve the below problem in the best way:
I have a Data Frame(Say MainDF ), which has millions of records. The format is like this (name:String, value:Int) . Content example below:
Davi,130
Joel,20
Emma,500
I have another small file, with 4 lines of record, like this (className:String,minValue:Int,maxValue:Int) Now I need create a file by looking up the class name based on the value between min and max , output for the above record below:
First,500,9999999
Second,100,499
Third,0,99
Unknown,-99999,0
I need to look up this small file for each value in the MainDF, and add the Class name based on the value range from small File.Example :
Davi,130,Second
Joel,20,Third
Emma,500,First
This is the code I have written:
//Main Data read, millions of records
val MainData = sc.textFile("/mainfile.csv")
case class MainType(Name:String,value:Int)
val MainDF = MainData .map(line => line.split(",")).map(e =>MainType(e(0),e(1).toInt))).toDF
MainDF.registerTempTable("MainTable")
val refData = sc.broadast( sc.textFile("/refdata.csv"))
case class refDataType (className:String,minValue:Int,maxValue:Int)
//ref data, just 4 records
val refRDD = refData.map(line=> line.split(",")).map( e => refDataType ( e(0) , e(1).toInt, e(2).toInt))
I think I have to write a UDF here, but I dont know how to use a Dataframe in a UDF, or is there any way to do this in spark scala
Upvotes: 0
Views: 282
Reputation: 3055
The easiest way here is to read both the files using the csv
datasource and joining them using standard SparkSQL, like this:
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
val mainSchema = StructType(Seq(StructField("name", StringType, false),
StructField("value", IntegerType, false)))
val mainDf = spark.read.schema(mainSchema).csv("/tmp/b.txt")
val lookupSchema = StructType(Seq(StructField("class_name", StringType, false), StructField("min_value", IntegerType, false),
StructField("max_value", IntegerType, false)))
val lookupDf = spark.read.schema(lookupSchema).csv("/tmp/a.txt")
val result = mainDf.join(lookupDf, $"value" <= $"max_value" && $"value" > $"min_value")
result.show()
I am not sure whether the most performant way is this one or the one suggested by @philantrovert (this might also depend on the Spark version you are using). You should try both them and decide yourself.
Upvotes: 1
Reputation: 10082
You can read the file as an RDD by using textFile
, collect it since it's very small (and maybe broadcast depending on your requirement).
Once you have the Array by collecting the RDD, you can create a Range
and then a UDF to check if your value is in that range.
val rdd = sc.parallelize(Array(
("First",500,9999999),
("Second",100,499),
("Third",0,99),
("Unknown",-99999,0)
))
val dataArr = rdd.map{ case (className, min, max) =>
(className, Range(min, max) ) }.collect
// First Element will be the Class Name
// Second will be the Range(min, max)
// sc.broadcast(dataArr) here
val getClassName = udf {(x: Int) => {
dataArr.map{ e =>
if (e._2.contains(x) ) e._1.toString
else null.asInstanceOf[String] }
.filter(_ != null )
.apply(0) }}
df.withColumn("ClassName", getClassName($"VALUE") ).show
+----+-----+---------+
|NAME|VALUE|ClassName|
+----+-----+---------+
|Davi| 130| Second|
|Joel| 20| Third|
|Emma| 500| First|
+----+-----+---------+
I'm positive there might be better solutions available.
Upvotes: 1