KristiLuna
KristiLuna

Reputation: 1903

Airflow s3Hook - read files in s3 with pandas read_csv

I'm trying to read some files with pandas using the s3Hook to get the keys. I'm able to get the keys, however I'm not sure how to get pandas to find the files, when I run the below I get:

No such file or directory:

Here is my code:

def transform_pages(company, **context):
    ds = context.get("execution_date").strftime('%Y-%m-%d')

    s3 = S3Hook('aws_default')
    s3_conn = s3.get_conn()
    keys = s3.list_keys(bucket_name=Variable.get('s3_bucket'),
                        prefix=f'S/{company}/pages/date={ds}/',
                        delimiter="/")

    prefix = f'S/{company}/pages/date={ds}/'
    logging.info(f'keys from function: {keys}')

    """ transforming pages and loading data back to S3 """
    for file in keys:
        df = pd.read_csv(file, sep='\t', skiprows=1, header=None)

Upvotes: 6

Views: 7494

Answers (1)

fsl
fsl

Reputation: 3280

The format you are looking for is the following:

filepath = f"s3://{bucket_name}/{key}"

So in your specific case, something like:

for file in keys:
    filepath = f"s3://s3_bucket/{file}"
    df = pd.read_csv(filepath, sep='\t', skiprows=1, header=None)

Just make sure you have s3fs installed though (pip install s3fs).

Upvotes: 3

Related Questions