Airflow s3Hook - read files in s3 with pandas read_csv

Question

I'm trying to read some files with pandas using the s3Hook to get the keys. I'm able to get the keys, however I'm not sure how to get pandas to find the files, when I run the below I get:

No such file or directory:

Here is my code:

def transform_pages(company, **context):
    ds = context.get("execution_date").strftime('%Y-%m-%d')

    s3 = S3Hook('aws_default')
    s3_conn = s3.get_conn()
    keys = s3.list_keys(bucket_name=Variable.get('s3_bucket'),
                        prefix=f'S/{company}/pages/date={ds}/',
                        delimiter="/")

    prefix = f'S/{company}/pages/date={ds}/'
    logging.info(f'keys from function: {keys}')

    """ transforming pages and loading data back to S3 """
    for file in keys:
        df = pd.read_csv(file, sep='	', skiprows=1, header=None)

fsl · Accepted Answer

The format you are looking for is the following:

filepath = f"s3://{bucket_name}/{key}"

So in your specific case, something like:

for file in keys:
    filepath = f"s3://s3_bucket/{file}"
    df = pd.read_csv(filepath, sep='	', skiprows=1, header=None)

Just make sure you have s3fs installed though (pip install s3fs).

Airflow s3Hook - read files in s3 with pandas read_csv

Answers (1)

Related Questions