RK.
RK.

Reputation: 617

How to load a csv/txt file into AWS Glue job

I have below 2 clarifications on AWS Glue, could you please clarify. Because I need to use glue as part of my project.

  1. I would like to load a csv/txt file into a Glue job to process it. (Like we do in Spark with dataframes). Is this possible in Glue? Or do we have to use only Crawlers to crawl the data into Glue tables and make use of them like below for further processing?

    empdf = glueContext.create_dynamic_frame.from_catalog(
        database="emp",
        table_name="emp_json")
    
  2. Below I used Spark code to load a file into Glue, but I'm getting lengthy error logs. Can we directly run Spark or PySpark code as it is without any changes in Glue?

    import sys
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    dfnew = spark.read.option("header","true").option("delimiter", ",").csv("C:\inputs\TEST.txt")
    dfnew.show(2)
    

Upvotes: 5

Views: 28168

Answers (2)

Yuriy Bondaruk
Yuriy Bondaruk

Reputation: 4750

It's possible to load data directly from s3 using Glue:

sourceDyf = glueContext.create_dynamic_frame_from_options(
    connection_type="s3",
    format="csv",
    connection_options={
        "paths": ["s3://bucket/folder"]
    },
    format_options={
        "withHeader": True,
        "separator": ","
    })

You can also do that just with spark (as you already tried):

sourceDf = spark.read
    .option("header","true")
    .option("delimiter", ",")
    .csv("C:\inputs\TEST.txt") 

However, in this case Glue doesn't guarantee that they provide appropriate Spark readers. So if your error is related to missing data source for CSV then you should add spark-csv lib to the Glue job by providing s3 path to its locations via --extra-jars parameter.

Upvotes: 11

RK.
RK.

Reputation: 617

Below 2 cases i tested working fine:

To load a file from S3 into Glue.

dfnew = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://MyBucket/path/"] }, format="csv" )

dfnew.show(2)

To load data from Glue db and tables which are generated already through Glue Crawlers.

DynFr = glueContext.create_dynamic_frame.from_catalog(database="test_db", table_name="test_table")

DynFr is a DynamicFrame, so if we want to work with Spark code in Glue, then we need to convert it into a normal data frame like below.

df1 = DynFr.toDF()

Upvotes: 5

Related Questions