Reputation: 425
I read in a parquet file from S3 in databricks using the following command
df = sqlContext.read.parquet('s3://path/to/parquet/file')
I want to read the schema of the dataframe, which I can do using the following command:
df_schema = df.schema.json()
But I am not able to write the df_schama
object to a file on S3.
Note: I am open to not creating a json file. I just want to save the schema of the dataframe to any file type (possibly a text file) in AWS S3.
I have tried writing the json schema as follows,
df_schema.write.csv("s3://path/to/file")
or
a.write.format('json').save('s3://path/to/file')
Both of them give me the following errors:
AttributeError: 'str' object has no attribute 'write'
Upvotes: 6
Views: 8984
Reputation: 31490
df.schema.json()
results string
object and string
objects won't have .write
method.
In RDD Api:
df_schema = df.schema.json()
parallelize df_schema
variable to create rdd
and then use .saveAsTextFile
method to write the schema to s3.
sc.parallelize([df_schema]).saveAsTextFile("s3://path/to/file")
(or)
In Dataframe Api:
from pyspark.sql import Row
df_schema = df.schema.json()
df_sch=sc.parallelize([Row(schema=df_schema)]).toDF()
df_sch.write.csv("s3://path/to/file")
df_sch.write.text("s3://path/to/file") //write as textfile
Upvotes: 2
Reputation: 2767
Here is a working example of saving a schema and applying it to new csv data:
# funcs
from pyspark.sql.functions import *
from pyspark.sql.types import *
# example old df schema w/ long datatype
df = spark.range(10)
df.printSchema()
df.write.mode("overwrite").csv("old_schema")
root
|-- id: long (nullable = false)
# example new df schema we will save w/ int datatype
df = df.select(col("id").cast("int"))
df.printSchema()
root
|-- id: integer (nullable = false)
# get schema as json object
schema = df.schema.json()
# write/read schema to s3 as .txt
import json
with open('s3:/path/to/schema.txt', 'w') as F:
json.dump(schema, F)
with open('s3:/path/to/schema.txt', 'r') as F:
saved_schema = json.load(F)
# saved schema
saved_schema
'{"fields":[{"metadata":{},"name":"id","nullable":false,"type":"integer"}],"type":"struct"}'
# construct saved schema object
new_schema = StructType.fromJson(json.loads(saved_schema))
new_schema
StructType(List(StructField(id,IntegerType,false)))
# use saved schema to read csv files ... new df has int datatype and not long
new_df = spark.read.csv("old_schema", schema=new_schema)
new_df.printSchema()
root
|-- id: integer (nullable = true)
Upvotes: 4