Reputation: 142
I'm having issues reading data with a AWS Glue Job in PySpark:
Data is sent from a AWS firehose (sample data) to a s3 bucket, stored as JSON and compressed with snappy-hadoop.
I'm able to read data from legacy Spark dataframe with spark.read.json() but this won't work with Glue Dynamic Frame (schema is not parsed at all) using from_catalog or from_options method :
Spark Legacy DataFrame
# import from legacy spark read
spark_df = spark.read.json("s3://my-bucket/sample-json-hadoop-snappy/")
spark_df.printSchema()
- result:
root
|-- change: double (nullable = true)
|-- price: double (nullable = true)
|-- sector: string (nullable = true)
|-- ticker_symbol: string (nullable = true)
|-- year: integer (nullable = true)
|-- dt: date (nullable = true)
Glue DynamicFrame
# import from glue options
options_df = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options = {"paths": ["s3://my-bucket/sample-json-hadoop-snappy/"]},
format="json"
)
options_df.printSchema()
- result:
root
Upvotes: 4
Views: 2842
Reputation: 5526
You can use spark legacy in the glue job also and if you want to perform operations on glue libraries only then read using spark then convert the df into dynamic frame.
df = spark.read.json("s3://my-bucket/sample-json-hadoop-snappy/")
from awsglue.dynamicframe import DynamicFrame
DynF = DynamicFrame.fromDF(df, glueContext, "df")
Currently snappy compression is supported with parquet files only in Glue libs.
Upvotes: 3