Reputation: 2207
I want to read a parquet file using spark sql in which one column has mixed datatype (string and integer).
val sqlContext = new SQLContext(sparkContext)
val df = sqlContext.read.parquet("/tmp/data")
This throws me exception : Failed to merge incompatible data types IntegerType and StringType
Is there a way to explicitly type cast the column during read ?
Upvotes: 2
Views: 5539
Reputation: 61
The only way that I have found is to manually cast one of the fields so that they match. You can do this by reading in the individual parquet files into a sequence and iteratively modifying them as such:
def unionReduce(dfs: Seq[DataFrame]) = {
dfs.reduce{ (x, y) =>
def schemaTruncate(df: DataFrame) = df.schema.map(schema => schema.name -> schema.dataType)
val diff = schemaTruncate(y).toSet.diff(schemaTruncate(x).toSet)
val fixedX = diff.foldLeft(x) { case (df, (name, dataType)) =>
Try(df.withColumn(name, col(name).cast(dataType))) match {
case Success(newDf) => newDf
case Failure(error) => df.withColumn(name, lit(null).cast(dataType))
}
}
fixedX.select(y.columns.map(col): _*).unionAll(y)
}
}
The above function first finds the differently named or typed columns which are in Y but not in X. It then adds those columns to X by attempting to cast the existing columns, and upon failure adding the column as a literal null, then it selects only the columns in Y from the new fixed X incase there are columns in X not in Y and returns the result of the union.
Upvotes: 2