RichaDwivedi
RichaDwivedi

Reputation: 343

null pointer exception while converting dataframe to list inside udf

I am reading 2 different .csv files which has only column as below:

    val dF1 = sqlContext.read.csv("some.csv").select($"ID")
    val dF2 = sqlContext.read.csv("other.csv").select($"PID")

trying to search if dF2("PID") exists in dF1("ID"):

    val getIdUdf = udf((x:String)=>{dF1.collect().map(_(0)).toList.contains(x)})
    val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))

This gives me null pointer exception. but if I convert dF1 outside and use list in udf it works:

    val dF1 = sqlContext.read.csv("some.csv").select($"ID").collect().map(_(0)).toList
    val getIdUdf = udf((x:String)=>{dF1.contains(x)})
    val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))

I know I can use join to get this done but want to know what is the reason of null pointer exception here.

Thanks.

Upvotes: 1

Views: 283

Answers (1)

efan
efan

Reputation: 968

Please check this question about accessing dataframe inside the transformation of another dataframe. This is exactly what you are doing with your UDF, and this is not possible in spark. Solution is either to use join, or collect outside of transformation and broadcast.

Upvotes: 3

Related Questions