Phagun Baya
Phagun Baya

Reputation: 2207

Read parquet file having mixed data type in a column

I want to read a parquet file using spark sql in which one column has mixed datatype (string and integer).

val sqlContext = new SQLContext(sparkContext)
val df = sqlContext.read.parquet("/tmp/data")

This throws me exception : Failed to merge incompatible data types IntegerType and StringType

Is there a way to explicitly type cast the column during read ?

Upvotes: 2

Views: 5539

Answers (1)

dkmonet
dkmonet

Reputation: 61

The only way that I have found is to manually cast one of the fields so that they match. You can do this by reading in the individual parquet files into a sequence and iteratively modifying them as such:

def unionReduce(dfs: Seq[DataFrame]) = {
  dfs.reduce{ (x, y) =>
    def schemaTruncate(df: DataFrame) = df.schema.map(schema => schema.name -> schema.dataType)
    val diff = schemaTruncate(y).toSet.diff(schemaTruncate(x).toSet)
    val fixedX = diff.foldLeft(x) { case (df, (name, dataType)) =>
      Try(df.withColumn(name, col(name).cast(dataType))) match {
        case Success(newDf) => newDf
        case Failure(error) => df.withColumn(name, lit(null).cast(dataType))
      }
    }
    fixedX.select(y.columns.map(col): _*).unionAll(y)
  }
}

The above function first finds the differently named or typed columns which are in Y but not in X. It then adds those columns to X by attempting to cast the existing columns, and upon failure adding the column as a literal null, then it selects only the columns in Y from the new fixed X incase there are columns in X not in Y and returns the result of the union.

Upvotes: 2

Related Questions