Erik Barajas
Erik Barajas

Reputation: 193

Getting values of Fields of a Row of DataFrame - Spark Scala

I have a DataFrame which contains several records,

enter image description here

Image in order to show that the DF contains data

I want to iterate each row of this DataFrame in order to validate the data of each of its columns, doing something like the following code:

val validDF = dfNextRows.map {
    x => ValidateRow(x)
}

def ValidateRow(row: Row) : Boolean =  {
    val nC = row.getString(0)
    val si = row.getString(1)
    val iD = row.getString(2)
    val iH = row.getString(3)
    val sF = row.getString(4)

    // Stuff to validate the data field of each row
    validateNC(nC)
    validateSI(SI)
    validateID(ID)
    validateIF(IF)
    validateSF(SF)
    true
}

But, doing some tests, if I want to print the value of the val nC (to be sure that I'm sending the corret information to each functions), it doesn't bring me anything:

def ValidateRow(row: Row) : Boolean =  {
    val nC = row.getString(0)
    val si = row.getString(1)
    val iD = row.getString(2)
    val iH = row.getString(3)
    val sF = row.getString(4)

    println(nC)

    validateNC(nC)
    validateSI(SI)
    validateID(ID)
    validateIF(IF)
    validateSF(SF)
    true
}

enter image description here

How can I know that I'm sending the correct information to each function (that I'm reading the data of each column of the row correctly)?

Regards.

Upvotes: 1

Views: 7388

Answers (2)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

Spark dataframe function should give you a good start.

If your validate functions are simple enough (like checking for null values), then you can embed the functions as

dfNextRows.withColumn("num_cta", when(col("num_cta").isNotNull, col("num_cta").otherwise(lit(0)) ))

You can do the same for other columns in the same manner just by using appropriate spark dataframe functions

If your validation rules are complex then you can use udf functions as

def validateNC = udf((num_cta : Long)=> {
   //define your rules here
})

You can call the udf function using withColumn as

dfNextRows.withColumn("num_cta", validateNC(col("num_cta")))

You can do so for your rest of the validate rules.

I hope to see your problem get resolved soon

Upvotes: 3

0x6C38
0x6C38

Reputation: 7076

map is a transformation, you need to apply an action, for instance you could do dfNextRows.map(x => ValidaLinea(x)).first. Spark operates lazily, much like the Stream class on the standard collections.

Upvotes: 2

Related Questions