Reputation: 1903
I'm trying to read some files with pandas using the s3Hook
to get the keys. I'm able to get the keys, however I'm not sure how to get pandas to find the files, when I run the below I get:
No such file or directory:
Here is my code:
def transform_pages(company, **context):
ds = context.get("execution_date").strftime('%Y-%m-%d')
s3 = S3Hook('aws_default')
s3_conn = s3.get_conn()
keys = s3.list_keys(bucket_name=Variable.get('s3_bucket'),
prefix=f'S/{company}/pages/date={ds}/',
delimiter="/")
prefix = f'S/{company}/pages/date={ds}/'
logging.info(f'keys from function: {keys}')
""" transforming pages and loading data back to S3 """
for file in keys:
df = pd.read_csv(file, sep='\t', skiprows=1, header=None)
Upvotes: 6
Views: 7494
Reputation: 3280
The format you are looking for is the following:
filepath = f"s3://{bucket_name}/{key}"
So in your specific case, something like:
for file in keys:
filepath = f"s3://s3_bucket/{file}"
df = pd.read_csv(filepath, sep='\t', skiprows=1, header=None)
Just make sure you have s3fs
installed though (pip install s3fs
).
Upvotes: 3