Reputation: 283
I am working on a project that involves reading a log record with fields that are base64-encoded protobufs from AWS S3, into a ScalaPB generated class, in Spark 3.3.1.
One of the proto messages (which is used in a production system and not easy to change) has field names that are causing de-ser errors:
syntax = "proto3";
import "google/protobuf/wrappers.proto";
package my.package;
option java_multiple_files = true;
message MyMap {
map<string, bool> bool = 5;
map<string, double> double = 6;
...
}
When leaving these fields as is, I get this error when trying to use the parseFrom
method:
java.lang.UnsupportedOperationException: `double` is not a valid identifier of Java and cannot be used as field name
So I found the Package-scoped options customization in ScalaPB docs: https://scalapb.github.io/docs/customizations#package-scoped-options
I added a package.proto as follows:
syntax = "proto3";
package my.package;
import "scalapb/scalapb.proto";
option (scalapb.options) = {
scope: PACKAGE
preserve_unknown_fields: false // Using this because SparkSql-ScalaPB doesn't like to encode "UnknownFieldSet".
aux_field_options: [
...
{
target: "my.package.MyMap.double"
options: {
scala_name: "doubleValue"
}
},
...
]
};
I verified that the Scala code is generated properly and I am able to instantiate an object with the "doubleValue" field.
final case class MyMap(
...
doubleValue: _root_.scala.collection.immutable.Map[_root_.scala.Predef.String, _root_.scala.Double] = _root_.scala.collection.immutable.Map.empty,
...
)
But when running in a Spark envioronment (AWS EMR, Spark 3.3.1, Zeppelin notebook), I get this issue:
import scalapb.spark.Implicits._
import scalapb.spark.ProtoSQL
val example = MyClass() // this has a field with type MyMap.
val df = ProtoSQL.createDataFrame(spark, Seq(example))
df.as[MyClass].printSchema
org.apache.spark.sql.AnalysisException: No such struct field boolValue in listBool, listDouble, listString, listLong, bool, double, string, long, float, integer, listFloat, listInteger
So I tried using the "typedEncoderToEncoder":
df.as[MyClass](typedEncoderToEncoder[MyClass]).printSchema
root
...
|-- id: string (nullable = true)
|-- myMap: struct (nullable = true)
| |-- bool: map (nullable = true)
| | |-- key: string
| | |-- value: boolean (valueContainsNull = true)
| |-- double: map (nullable = true)
| | |-- key: string
| | |-- value: double (valueContainsNull = true)
...
But I expect it to have the renamed fields: "boolValue", "doubleValue".
This bug is causing issues in the round trip of serialization and de-serialization using the ScalaPB generated case-classes.
I am pretty confident this isn't a version compatibility issue, but here's the versions I am using (the build tool is some corporate wrapper over ant):
com.thesamet.scalapb:sparksql33-scalapb0_11-1.0.4
com.thesamet.scalapb:scalapb-runtime-0.11.15
org.typelevel:frameless-dataset-0.15.0
org.scala-lang.modules_scala-collection-compat-2.11.0
com.chuusai_shapeless-2.3.10
org.typelevel_frameless-core-0.15.0
org.typelevel_frameless-dataset-0.15.0
protobuf-26.1
I am shading com.google.protobuf.**
and scala.collection.compat.**
but I wasn't able to shade "shapeless" as suggested in the ScalaPB docs, as it creates a chain of dependencies (frameless, scalapb, etc.).
Any thoughts?
Upvotes: 0
Views: 35