Reputation: 617
I have below 2 clarifications on AWS Glue, could you please clarify. Because I need to use glue as part of my project.
I would like to load a csv/txt file into a Glue job to process it. (Like we do in Spark with dataframes). Is this possible in Glue? Or do we have to use only Crawlers to crawl the data into Glue tables and make use of them like below for further processing?
empdf = glueContext.create_dynamic_frame.from_catalog(
database="emp",
table_name="emp_json")
Below I used Spark code to load a file into Glue, but I'm getting lengthy error logs. Can we directly run Spark or PySpark code as it is without any changes in Glue?
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
dfnew = spark.read.option("header","true").option("delimiter", ",").csv("C:\inputs\TEST.txt")
dfnew.show(2)
Upvotes: 5
Views: 28168
Reputation: 4750
It's possible to load data directly from s3 using Glue:
sourceDyf = glueContext.create_dynamic_frame_from_options(
connection_type="s3",
format="csv",
connection_options={
"paths": ["s3://bucket/folder"]
},
format_options={
"withHeader": True,
"separator": ","
})
You can also do that just with spark (as you already tried):
sourceDf = spark.read
.option("header","true")
.option("delimiter", ",")
.csv("C:\inputs\TEST.txt")
However, in this case Glue doesn't guarantee that they provide appropriate Spark readers. So if your error is related to missing data source for CSV then you should add spark-csv lib to the Glue job by providing s3 path to its locations via --extra-jars parameter.
Upvotes: 11
Reputation: 617
Below 2 cases i tested working fine:
dfnew = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://MyBucket/path/"] }, format="csv" )
dfnew.show(2)
DynFr = glueContext.create_dynamic_frame.from_catalog(database="test_db", table_name="test_table")
df1 = DynFr.toDF()
Upvotes: 5