Reputation: 513
I'm trying to read a CSV file from a private S3 bucket to a pandas dataframe:
df = pandas.read_csv('s3://mybucket/file.csv')
I can read a file from a public bucket, but reading a file from a private bucket results in HTTP 403: Forbidden error.
I have configured the AWS credentials using aws configure.
I can download a file from a private bucket using boto3, which uses aws credentials. It seems that I need to configure pandas to use AWS credentials, but don't know how.
Upvotes: 50
Views: 119244
Reputation: 28946
UPDATE: 2023
You will need s3fs
to be installed as that's what pandas uses now.
This will most probably be a permission issue with either your IAM configuration or the way you're calling boto3
not configured properly. Also depends if it's running locally, on an ec2 etc.
OLD ANSWER
Pandas uses boto
(not boto3
) inside read_csv
. You might be able to install boto and have it work correctly.
There's some troubles with boto and python 3.4.4 / python3.5.1. If you're on those platforms, and until those are fixed, you can use boto 3 as
import boto3
import pandas as pd
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(obj['Body'])
That obj
had a .read
method (which returns a stream of bytes), which is enough for pandas.
Upvotes: 52
Reputation: 5599
You can use AWS SDK for Pandas, a library that extends Pandas to work smoothly with AWS data stores, such as S3.
import awswrangler as wr
df = wr.s3.read_csv("s3://bucket/file.csv")
It can be installed via pip install awswrangler
.
Upvotes: -1
Reputation: 423
Update for pandas 0.22 and up:
If you have already installed s3fs (pip install s3fs
) then you can read the file directly from s3 path, without any imports:
data = pd.read_csv('s3://bucket....csv')
Upvotes: 30
Reputation: 103
To add on to the other solutions here with more recent updates: pandas
, fsspec
, and s3fs
have all been updated such that you can reading directly from custom endpoints using pandas and no other imports. You must make sure you have both fsspec
and s3fs
installed, as they are optional dependencies for pandas.
Then you can use
import pandas as pd
pd.read_csv(
's3://mybucket/file.csv',
storage_options = {
client_kwargs = {
'endpoint_url': <MY_S3_ENDPOINT_URL>
}
}
Its clunky, but its required as maintainers of boto3
for some reason have refused to update the library to allow for custom endpoint configuration outside of client construction (i.e. in config files or environment variables) for years now. But if you don't need to be reading/writing through pandas, awswrangler
is perhaps better as others have mentioned.
Upvotes: 1
Reputation: 5292
The previous answers are a good basic start but I wanted to achieve advanced objectives stated below. Overall I feel awswrangler is the way to go.
Here are a couple of things that worked
import boto3
import pandas as pd
import awswrangler as wr
boto3_creds = dict(region_name="us-east-1", aws_access_key_id='', aws_secret_access_key='')
boto3.setup_default_session(**boto3_creds)
s3 = boto3.client('s3')
# read first 5 lines from file path
obj = s3.get_object(Bucket='bucket', Key='path.csv.gz')
df = pd.read_csv(obj['Body'], nrows=5, compression='gzip')
# read first 5 lines from directory
dft_xp = pd.concat(list(wr.s3.read_csv(wr.s3.list_objects('s3://bucket/path/')[0], chunksize=5, nrows=5, compression='gzip')))
# read all files into pandas
df_xp = wr.s3.read_csv(wr.s3.list_objects('s3://bucket/path/'), compression='gzip')
Did not use s3fs wasn't sure if it uses boto3.
For distributed compute with dask, this worked but it uses s3fs afaik and apparently gzip can't be parallized.
import dask.dataframe as dd
dd.read_csv('s3://bucket/path/*', storage_options={'key':'', 'secret':''}, compression='gzip').head(5)
dd.read_csv('s3://bucket/path/*', storage_options={'key':'', 'secret':''}, compression='gzip')
# Warning gzip compression does not support breaking apart files Please ensure that each individual file can fit in memory
Upvotes: 0
Reputation: 3125
Updated for Pandas 0.20.1
Pandas now uses s3fs to handle s3 coonnections. link
pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas.
import os
import pandas as pd
from s3fs.core import S3FileSystem
# aws keys stored in ini file in same path
# refer to boto3 docs for config settings
os.environ['AWS_CONFIG_FILE'] = 'aws_config.ini'
s3 = S3FileSystem(anon=False)
key = 'path\to\your-csv.csv'
bucket = 'your-bucket-name'
df = pd.read_csv(s3.open('{}/{}'.format(bucket, key), mode='rb'))
# or with f-strings
df = pd.read_csv(s3.open(f'{bucket}/{key}', mode='rb'))
Upvotes: 38
Reputation: 69
import s3fs
import pandas as pd
s3 = s3fs.S3FileSystem(profile='<profile_name>')
pd.read_csv(s3.open(<s3_path>))
You can also manually use credentials.
Upvotes: 3
Reputation: 145
In addition to other awesome answers, if a custom endpoint is required, it is possible to use pd.read_csv('s3://...')
syntax by monkey patching the s3fs init method.
import s3fs
s3fsinit = s3fs.S3FileSystem.__init__
def s3fsinit_patched(self, *k, *kw):
s3fsinit(self, *k, client_kwargs={'endpoint_url': 'https://yourcustomendpoint'}, **kw)
s3fs.S3FileSystem.__init__ = s3fsinit_patched
Or, a more elegant way:
import s3fs
class S3FileSystemPatched(s3fs.S3FileSystem):
def __init__(self, *k, **kw):
super(S3FileSystemPatched, self).__init__( *k,
key = os.environ['aws_access_key_id'],
secret = os.environ['aws_secret_access_key'],
client_kwargs={'endpoint_url': 'https://yourcustomendpoint'},
**kw)
print('S3FileSystem is patched')
s3fs.S3FileSystem = S3FileSystemPatched
Also see: s3fs custom endpoint url
Upvotes: 3
Reputation: 121
import pandas as pd
import boto3
from io import StringIO
# Read CSV
s3 = boto3.client('s3',endpoint_url,aws_access_key_id=,aws_secret_access_key)
read_file = s3.get_object(Bucket, Key)
df = pd.read_csv(read_file['Body'],sep=',')
# Write CSV
csv_buffer = StringIO()
df.to_csv(csv_buffer)
s3.put_object(Bucket, Key,Body=csv_buffer.getvalue())
Upvotes: 12
Reputation: 1146
Note that if your bucket is private AND on an aws-like provider, you will meet errors as s3fs
does not load the profile config file at ~/.aws/config
like awscli
.
One solution is to define the current environment variable :
export AWS_S3_ENDPOINT="myEndpoint"
export AWS_DEFAULT_REGION="MyRegion"
Upvotes: 0
Reputation: 1982
Based on this answer, I found smart_open
to be much simpler to use:
import pandas as pd
from smart_open import smart_open
initial_df = pd.read_csv(smart_open('s3://bucket/file.csv'))
Upvotes: 16
Reputation: 3701
Update for pandas 0.20.3 without using s3fs:
import boto3
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO # Python 2.x
else:
from io import StringIO # Python 3.x
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
body = obj['Body']
csv_string = body.read().decode('utf-8')
df = pd.read_csv(StringIO(csv_string))
Upvotes: 2