MatanKri
MatanKri

Reputation: 283

Issue with sparksql33-scalapb0_11 ignores aux_field_options scala_name mapping

I am working on a project that involves reading a log record with fields that are base64-encoded protobufs from AWS S3, into a ScalaPB generated class, in Spark 3.3.1.

One of the proto messages (which is used in a production system and not easy to change) has field names that are causing de-ser errors:


syntax = "proto3";

import "google/protobuf/wrappers.proto";

package my.package;

option java_multiple_files = true;

message MyMap {
    map<string, bool> bool = 5;
    map<string, double> double = 6;
    ...
}

When leaving these fields as is, I get this error when trying to use the parseFrom method:

java.lang.UnsupportedOperationException: `double` is not a valid identifier of Java and cannot be used as field name

So I found the Package-scoped options customization in ScalaPB docs: https://scalapb.github.io/docs/customizations#package-scoped-options

I added a package.proto as follows:

syntax = "proto3";

package my.package;

import "scalapb/scalapb.proto";

option (scalapb.options) = {
  scope: PACKAGE
  preserve_unknown_fields: false // Using this because SparkSql-ScalaPB doesn't like to encode "UnknownFieldSet".
  aux_field_options: [
...
    {
      target: "my.package.MyMap.double"
      options: {
        scala_name: "doubleValue"
      }
    },
...
  ]
};

I verified that the Scala code is generated properly and I am able to instantiate an object with the "doubleValue" field.

final case class MyMap(
...
    doubleValue: _root_.scala.collection.immutable.Map[_root_.scala.Predef.String, _root_.scala.Double] = _root_.scala.collection.immutable.Map.empty,
...
)

But when running in a Spark envioronment (AWS EMR, Spark 3.3.1, Zeppelin notebook), I get this issue:

import scalapb.spark.Implicits._
import scalapb.spark.ProtoSQL

val example = MyClass() // this has a field with type MyMap.
val df = ProtoSQL.createDataFrame(spark, Seq(example))

df.as[MyClass].printSchema

org.apache.spark.sql.AnalysisException: No such struct field boolValue in listBool, listDouble, listString, listLong, bool, double, string, long, float, integer, listFloat, listInteger

So I tried using the "typedEncoderToEncoder":

df.as[MyClass](typedEncoderToEncoder[MyClass]).printSchema


root
...
 |-- id: string (nullable = true)
 |-- myMap: struct (nullable = true)
 |    |-- bool: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: boolean (valueContainsNull = true)
 |    |-- double: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: double (valueContainsNull = true)
...

But I expect it to have the renamed fields: "boolValue", "doubleValue".

This bug is causing issues in the round trip of serialization and de-serialization using the ScalaPB generated case-classes.

I am pretty confident this isn't a version compatibility issue, but here's the versions I am using (the build tool is some corporate wrapper over ant):

com.thesamet.scalapb:sparksql33-scalapb0_11-1.0.4
com.thesamet.scalapb:scalapb-runtime-0.11.15
org.typelevel:frameless-dataset-0.15.0
org.scala-lang.modules_scala-collection-compat-2.11.0
com.chuusai_shapeless-2.3.10
org.typelevel_frameless-core-0.15.0
org.typelevel_frameless-dataset-0.15.0
protobuf-26.1

I am shading com.google.protobuf.** and scala.collection.compat.** but I wasn't able to shade "shapeless" as suggested in the ScalaPB docs, as it creates a chain of dependencies (frameless, scalapb, etc.).

Any thoughts?

Upvotes: 0

Views: 35

Answers (0)

Related Questions