IgorK
IgorK

Reputation: 513

Reading a file from a private S3 bucket to a pandas dataframe

I'm trying to read a CSV file from a private S3 bucket to a pandas dataframe:

df = pandas.read_csv('s3://mybucket/file.csv')

I can read a file from a public bucket, but reading a file from a private bucket results in HTTP 403: Forbidden error.

I have configured the AWS credentials using aws configure.

I can download a file from a private bucket using boto3, which uses aws credentials. It seems that I need to configure pandas to use AWS credentials, but don't know how.

Upvotes: 50

Views: 119244

Answers (12)

TomAugspurger
TomAugspurger

Reputation: 28946

UPDATE: 2023

You will need s3fs to be installed as that's what pandas uses now. This will most probably be a permission issue with either your IAM configuration or the way you're calling boto3 not configured properly. Also depends if it's running locally, on an ec2 etc.

OLD ANSWER

Pandas uses boto (not boto3) inside read_csv. You might be able to install boto and have it work correctly.

There's some troubles with boto and python 3.4.4 / python3.5.1. If you're on those platforms, and until those are fixed, you can use boto 3 as

import boto3
import pandas as pd

s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(obj['Body'])

That obj had a .read method (which returns a stream of bytes), which is enough for pandas.

Upvotes: 52

Theofilos Papapanagiotou
Theofilos Papapanagiotou

Reputation: 5599

You can use AWS SDK for Pandas, a library that extends Pandas to work smoothly with AWS data stores, such as S3.

import awswrangler as wr
df = wr.s3.read_csv("s3://bucket/file.csv")

It can be installed via pip install awswrangler.

Upvotes: -1

Isaac
Isaac

Reputation: 423

Update for pandas 0.22 and up:

If you have already installed s3fs (pip install s3fs) then you can read the file directly from s3 path, without any imports:

data = pd.read_csv('s3://bucket....csv')

stable docs

Upvotes: 30

Bart Spoon
Bart Spoon

Reputation: 103

To add on to the other solutions here with more recent updates: pandas, fsspec, and s3fs have all been updated such that you can reading directly from custom endpoints using pandas and no other imports. You must make sure you have both fsspec and s3fs installed, as they are optional dependencies for pandas.

Then you can use

import pandas as pd

pd.read_csv(
    's3://mybucket/file.csv',
    storage_options = {
        client_kwargs = {
            'endpoint_url': <MY_S3_ENDPOINT_URL>
        }
    }

Its clunky, but its required as maintainers of boto3 for some reason have refused to update the library to allow for custom endpoint configuration outside of client construction (i.e. in config files or environment variables) for years now. But if you don't need to be reading/writing through pandas, awswrangler is perhaps better as others have mentioned.

Upvotes: 1

citynorman
citynorman

Reputation: 5292

The previous answers are a good basic start but I wanted to achieve advanced objectives stated below. Overall I feel awswrangler is the way to go.

  1. read .gzip
  2. read only the first 5 lines without downloading the full file
  3. explicitly pass credentials (make sure you don't commit them to code!!)
  4. use full s3 paths

Here are a couple of things that worked


import boto3
import pandas as pd
import awswrangler as wr

boto3_creds = dict(region_name="us-east-1", aws_access_key_id='', aws_secret_access_key='')
boto3.setup_default_session(**boto3_creds)

s3 = boto3.client('s3')

# read first 5 lines from file path
obj = s3.get_object(Bucket='bucket', Key='path.csv.gz')
df = pd.read_csv(obj['Body'], nrows=5, compression='gzip')

# read first 5 lines from directory
dft_xp = pd.concat(list(wr.s3.read_csv(wr.s3.list_objects('s3://bucket/path/')[0], chunksize=5, nrows=5, compression='gzip')))

# read all files into pandas
df_xp = wr.s3.read_csv(wr.s3.list_objects('s3://bucket/path/'), compression='gzip')

Did not use s3fs wasn't sure if it uses boto3.

For distributed compute with dask, this worked but it uses s3fs afaik and apparently gzip can't be parallized.


import dask.dataframe as dd

dd.read_csv('s3://bucket/path/*', storage_options={'key':'', 'secret':''}, compression='gzip').head(5)

dd.read_csv('s3://bucket/path/*', storage_options={'key':'', 'secret':''}, compression='gzip') 
# Warning gzip compression does not support breaking apart files Please ensure that each individual file can fit in memory
 

Upvotes: 0

spitfiredd
spitfiredd

Reputation: 3125

Updated for Pandas 0.20.1

Pandas now uses s3fs to handle s3 coonnections. link

pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas.

import os

import pandas as pd
from s3fs.core import S3FileSystem

# aws keys stored in ini file in same path
# refer to boto3 docs for config settings
os.environ['AWS_CONFIG_FILE'] = 'aws_config.ini'

s3 = S3FileSystem(anon=False)
key = 'path\to\your-csv.csv'
bucket = 'your-bucket-name'

df = pd.read_csv(s3.open('{}/{}'.format(bucket, key), mode='rb'))
# or with f-strings
df = pd.read_csv(s3.open(f'{bucket}/{key}', mode='rb'))

Upvotes: 38

Ze Tang
Ze Tang

Reputation: 69

import s3fs
import pandas as pd
s3 = s3fs.S3FileSystem(profile='<profile_name>')
pd.read_csv(s3.open(<s3_path>))

You can also manually use credentials.

Upvotes: 3

fmguler
fmguler

Reputation: 145

In addition to other awesome answers, if a custom endpoint is required, it is possible to use pd.read_csv('s3://...') syntax by monkey patching the s3fs init method.

import s3fs
s3fsinit = s3fs.S3FileSystem.__init__
def s3fsinit_patched(self, *k, *kw):
    s3fsinit(self, *k, client_kwargs={'endpoint_url': 'https://yourcustomendpoint'}, **kw)
s3fs.S3FileSystem.__init__ = s3fsinit_patched

Or, a more elegant way:

import s3fs
class S3FileSystemPatched(s3fs.S3FileSystem):
    def __init__(self, *k, **kw):
        super(S3FileSystemPatched, self).__init__( *k,
                                                  key = os.environ['aws_access_key_id'],
                                                  secret = os.environ['aws_secret_access_key'],
                                                  client_kwargs={'endpoint_url': 'https://yourcustomendpoint'},
                                                  **kw)
        print('S3FileSystem is patched')
s3fs.S3FileSystem = S3FileSystemPatched

Also see: s3fs custom endpoint url

Upvotes: 3

Saeed Rahman
Saeed Rahman

Reputation: 121

import pandas as pd
import boto3
from io import StringIO

# Read CSV
s3 = boto3.client('s3',endpoint_url,aws_access_key_id=,aws_secret_access_key)
read_file = s3.get_object(Bucket, Key)
df = pd.read_csv(read_file['Body'],sep=',')

# Write CSV
csv_buffer = StringIO()
df.to_csv(csv_buffer)
s3.put_object(Bucket, Key,Body=csv_buffer.getvalue())

Upvotes: 12

MCMZL
MCMZL

Reputation: 1146

Note that if your bucket is private AND on an aws-like provider, you will meet errors as s3fs does not load the profile config file at ~/.aws/config like awscli.

One solution is to define the current environment variable :

export AWS_S3_ENDPOINT="myEndpoint"
export AWS_DEFAULT_REGION="MyRegion"

Upvotes: 0

kepler
kepler

Reputation: 1982

Based on this answer, I found smart_open to be much simpler to use:

import pandas as pd
from smart_open import smart_open

initial_df = pd.read_csv(smart_open('s3://bucket/file.csv'))

Upvotes: 16

jpobst
jpobst

Reputation: 3701

Update for pandas 0.20.3 without using s3fs:

import boto3
import pandas as pd
import sys

if sys.version_info[0] < 3: 
    from StringIO import StringIO # Python 2.x
else:
    from io import StringIO # Python 3.x

s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
body = obj['Body']
csv_string = body.read().decode('utf-8')

df = pd.read_csv(StringIO(csv_string))

Upvotes: 2

Related Questions