Reputation: 111
Hi I am very new to AWS.
I am trying to retrieve a 5gb csv file that I have stored in a s3 bucket, do ETL on it and load it into a DynamoDB table using AWS Glue. My glue job is pure python bash shell not using spark.
My problem is that when I try to retrieve the file. I am getting File not found exception. Here is my code:
import boto3
import logging
import csv
import s3fs
from boto3 import client
from boto3.dynamodb.conditions import Key
from botocore.exceptions import ClientError
csv_file_path = 's3://my_s3_bucket/mycsv_file.csv'
A few lines down within my class.......:
with open(self.csv_file_path, "r") as input:
csv_reader = csv.reader(input, delimiter='^', quoting=csv.QUOTE_NONE)
for row in csv_reader:
within the with open function is where I get file not found. Even though it is there. I really do not want to use pandas. Weve had problems working with pandas within glue. Since this a 5gb file I cant store in memory thats why im trying to open it and read it row by row.
I would really appreciate the help on this.
Also I have the correct IAM glue permissions setup and everything.
Upvotes: 1
Views: 3101
Reputation: 111
I figured it out
you have to use the s3 module from boto
s3 = boto3.client('s3')
file = s3.get_object(Bucket='bucket_name', Key='file_name')
lines = file['Body'].read().decode('utf-8').splitlines(True)
csv_reader = csv.reader(lines, delimiter=',', quoting=csv.QUOTE_NONE)
and then just create a for loop for the csv reader
Upvotes: 4