Reputation: 4871
I need to parse a JSON schema file to create a pyspark.sql.types.StructType
. I have found a scala library which can do this for me. So I'm calling it like this:
f = open('path/to/schema.json')
js = f.read()
conv = dspark.sparkContext._jvm.org.zalando.spark.jsonschema.SchemaConverter
schema = conv.convertContent(js)
But when I try to use it to build a DataFrame
like this:
spark.read.format("json").schema(schema)
I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/sql/readwriter.py", line 103, in schema
raise TypeError("schema should be StructType")
TypeError: schema should be StructType
If I print the type:
print type(schema)
I get:
<class 'py4j.java_gateway.JavaObject'>
How do I wrap the value as a python StructType
?
Upvotes: 3
Views: 1807
Reputation: 4871
After digging around in the pyspark source I looked at the implementation for DataFrame.schema
:
@property
@since(1.3)
def schema(self):
if self._schema is None:
try:
self._schema = _parse_datatype_json_string(self._jdf.schema().json())
except AttributeError as e:
raise Exception(
"Unable to parse datatype from schema. %s" % e)
return self._schema
The method _parse_datatype_json_string
is defined in pyspark.sql.types
so this works:
from pyspark.sql.types import _parse_datatype_json_string
conv = self.spark.sparkContext._jvm.org.zalando.spark.jsonschema.SchemaConverter
jschema = conv.convertContent(read_schema)
schema = _parse_datatype_json_string(jschema.json())
src = src.schema(schema)
Now when I call:
print type(schema)
I get:
<class 'pyspark.sql.types.StructType'>
Upvotes: 4