Tibberzz
Tibberzz

Reputation: 551

Read Headers from Data Source in an AWS Glue Job

I have an AWS Glue job that reads from a data source like so:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dev-data", table_name = "contacts", transformation_ctx = "datasource0")

But when I call .toDF() on the dynamic frame, the headers are 'col0', 'col1', 'col2' etc. and my actual headers are in the first row of the dataframe.

Note - I can't set them manually as the columns in the data source are variable & iterating over the columns in a loop to set them results in error because you'd have to set the same dataframe variable multiple times, which glue can't handle.

How might I capture the headers while reading from the data source?

Upvotes: 8

Views: 9014

Answers (4)

rachi_gene
rachi_gene

Reputation: 1

I made few changes to read with header as following -

dyF = glueContext.create_dynamic_frame.from_options(
    's3',
    {'paths': ['s3://bucketname/key_to_csv_file']},
    format= 'csv',
    format_options= {'withHeader': True})

Upvotes: 0

TheGreenSpleen25
TheGreenSpleen25

Reputation: 21

I know this post is old, but I just ran into a similar issue and spent way too long figuring out what the problem was. Wanted to share my solution in case it's helpful to others!

I was using the GUI on AWS and forgot to actually add the correct classifier to the crawler before running it. This resulted in AWS Glue incorrectly detecting datatypes (they mostly came out as strings) and the column names were not detected (they came out as col1, col2, etc). You can create the classifier in "classifiers" under "crawlers". Then, when setting up the crawler, add your classifier to the "selected classifiers" section at the bottom.

Documentation: https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html

Upvotes: 2

Dheeraj Inampudi
Dheeraj Inampudi

Reputation: 1457

You can try withHeader param. e.g.

dyF = glueContext.create_dynamic_frame.from_options(
    's3',
    {'paths': ['s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv']},
    'csv',
    {'withHeader': True})

The documentation for this can be found here

Upvotes: 2

Tibberzz
Tibberzz

Reputation: 551

It turns out it's a bug in the glue crawler, they don't support headers yet. The workaround I used was to go through the motions of crawling the data anyways, then when the crawler completes, I have a lambda that triggers off of the crawler completion cloud watch event and the lambda kicks off the glue job that just reads directly from s3. When glue is fixed to support reading in the headers I can switch out how I read in the headers.

Upvotes: 1

Related Questions