How to load a csv/txt file into AWS Glue job

Question

I have below 2 clarifications on AWS Glue, could you please clarify. Because I need to use glue as part of my project.

I would like to load a csv/txt file into a Glue job to process it. (Like we do in Spark with dataframes). Is this possible in Glue? Or do we have to use only Crawlers to crawl the data into Glue tables and make use of them like below for further processing?
```
empdf = glueContext.create_dynamic_frame.from_catalog(
    database="emp",
    table_name="emp_json")
```

Below I used Spark code to load a file into Glue, but I'm getting lengthy error logs. Can we directly run Spark or PySpark code as it is without any changes in Glue?

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
dfnew = spark.read.option("header","true").option("delimiter", ",").csv("C:\inputs\TEST.txt")
dfnew.show(2)

RK. · Accepted Answer

Below 2 cases i tested working fine:

To load a file from S3 into Glue.

dfnew = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://MyBucket/path/"] }, format="csv" )

dfnew.show(2)

To load data from Glue db and tables which are generated already through Glue Crawlers.

DynFr = glueContext.create_dynamic_frame.from_catalog(database="test_db", table_name="test_table")

DynFr is a DynamicFrame, so if we want to work with Spark code in Glue, then we need to convert it into a normal data frame like below.

df1 = DynFr.toDF()

How to load a csv/txt file into AWS Glue job

Answers (2)

To load a file from S3 into Glue.

To load data from Glue db and tables which are generated already through Glue Crawlers.

DynFr is a DynamicFrame, so if we want to work with Spark code in Glue, then we need to convert it into a normal data frame like below.

Related Questions