M80
M80

Reputation: 994

Avoid conversion in double in Spark SQL with schema

I have a simple JSON as below, the value node sometimes will have STRING, and sometime will have DOUBLE. I want to treat the value as STRING as it comes. But when spark see that tag is double its converting to a different format with E

Input JSON

{"key" : "k1", "value": "86093351508521808.0"}
{"key" : "k2", "value": 86093351508521808.0}

Spark output CSV

k1,86093351508521808.0
k2,8.6093351508521808E16

Expected output

k1,86093351508521808.0
k2,86093351508521808.0

Please advise on how the desired output could be achieved. We never read the value in the tag, so we will never be aware of the precision and other details.

Below is the sample code

public static void main(String[] args) {
    SparkSession sparkSession = SparkSession
        .builder()
        .appName(TestSpark.class.getName())
        .master("local[*]").getOrCreate();

    SparkContext context = sparkSession.sparkContext();
    context.setLogLevel("ERROR");
    SQLContext sqlCtx = sparkSession.sqlContext();
    System.out.println("Spark context established");

    List<StructField> kvFields = new ArrayList<>();
    kvFields.add(DataTypes.createStructField("key", DataTypes.StringType, true));
    kvFields.add(DataTypes.createStructField("value", DataTypes.StringType, true));
    StructType employeeSchema = DataTypes.createStructType(kvFields);

    Dataset<Row> dataset = sparkSession.read()
        .option("inferSchema", false)
        .format("json")
        .schema(employeeSchema)
        .load("D:\\dev\\workspace\\java\\simple-kafka\\key_value.json");
    dataset.createOrReplaceTempView("sourceView");
    sqlCtx.sql("select * from sourceView  ")
        .write()
        .format("csv")
        .save("D:\\dev\\workspace\\java\\simple-kafka\\output\\" + UUID.randomUUID().toString());

    sparkSession.close();

}

Upvotes: 0

Views: 1212

Answers (1)

John
John

Reputation: 423

We can cast that column to DecimalType as follow:

scala> import org.apache.spark.sql.types.DecimalType;
import org.apache.spark.sql.types.DecimalType

scala> spark.read.json(sc.parallelize(Seq("""{"key" : "k1", "value": "86093351508521808.0"}""","""{"key" : "k2", "value": 86093351508521808.0}"""))).select(col("value").cast(DecimalType(28, 1))).show

+-------------------+
|              value|
+-------------------+
|86093351508521808.0|
|86093351508521808.0|
+-------------------+

Upvotes: 1

Related Questions