VermaC
VermaC

Reputation: 53

How to validate large csv file either column wise or row wise in spark dataframe

I have a large data file of 10GB or more with 150 columns in which we need to validate each of its data (datatype/format/null/domain value/primary key ..) with different rule and finally create 2 output file one is having success data and another having error data with error details. we need to move the row in error file if any of column having error at very first time no need to validate further.

I am reading a file in spark data frame does we validate it column-wise or row-wise by which way we got the best performance?

Upvotes: 0

Views: 662

Answers (1)

kavetiraviteja
kavetiraviteja

Reputation: 2208

To answer your question

I am reading a file in spark data frame do we validate it column-wise or row-wise by which way we got the best performance?

DataFrame is a distributed collection of data that is organized as set of rows distributed across the cluster and most of the transformation which is defined in spark is applied on the rows which work on Row object .

Psuedo code
 import spark.implicits._
  val schema = spark.read.csv(ip).schema
  
  spark.read.textFile(inputFile).map(row => {
      val errorInfo : Seq[(Row,String,Boolean)] = Seq()
      val data = schema.foreach(f => {
        // f.dataType //get field type and have custom logic on field type
        // f.name // get field name i.e., column name
        // val fieldValue = row.getAs(f.name) //get field value and have check's on field value on field type
        // if any error in field value validation then populate @errorInfo info object i.e (row,"error_info",false)
        // otherwise i.e (row,"",true)
      })
      data.filter(x => x._3).write.save(correctLoc)
      data.filter(x => !x._3).write.save(errorLoc)
    })

Upvotes: 1

Related Questions