How do I read row by row of a CSV file from S3 in AWS Glue Job

Question

Hi I am very new to AWS.

I am trying to retrieve a 5gb csv file that I have stored in a s3 bucket, do ETL on it and load it into a DynamoDB table using AWS Glue. My glue job is pure python bash shell not using spark.

My problem is that when I try to retrieve the file. I am getting File not found exception. Here is my code:

import boto3
import logging
import csv
import s3fs

from boto3 import client
from boto3.dynamodb.conditions import Key
from botocore.exceptions import ClientError

csv_file_path = 's3://my_s3_bucket/mycsv_file.csv'

A few lines down within my class.......:

with open(self.csv_file_path, "r") as input:
       csv_reader = csv.reader(input, delimiter='^', quoting=csv.QUOTE_NONE)

       for row in csv_reader:

within the with open function is where I get file not found. Even though it is there. I really do not want to use pandas. Weve had problems working with pandas within glue. Since this a 5gb file I cant store in memory thats why im trying to open it and read it row by row.

I would really appreciate the help on this.

Also I have the correct IAM glue permissions setup and everything.

greenway · Accepted Answer

I figured it out

you have to use the s3 module from boto

s3 = boto3.client('s3')

file = s3.get_object(Bucket='bucket_name', Key='file_name')

lines = file['Body'].read().decode('utf-8').splitlines(True)

csv_reader = csv.reader(lines, delimiter=',', quoting=csv.QUOTE_NONE)

and then just create a for loop for the csv reader

How do I read row by row of a CSV file from S3 in AWS Glue Job

Answers (1)

Related Questions