Reputation: 2091
I have a variable which has the aws s3 url
s3://bucket_name/folder1/folder2/file1.json
I want to get the bucket_name in a variables and rest i.e /folder1/folder2/file1.json in another variable. I tried the regular expressions and could get the bucket_name like below, not sure if there is a better way.
m = re.search('(?<=s3:\/\/)[^\/]+', 's3://bucket_name/folder1/folder2/file1.json')
print(m.group(0))
How do I get the rest i.e - folder1/folder2/file1.json ?
I have checked if there is a boto3 feature to extract the bucket_name and key from the url, but couldn't find it.
Upvotes: 104
Views: 159577
Reputation: 63
I got here and none of these things worked. AWS changed its specification for Paths to S3 documents in 2020 ... see here: https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html
Some of AWS's own documents (eg cloud pattern templates) are in old or legacy forms. In our company, we have documents stored before --and-- after 2020. The different formats break the suggestions already made here.
My (very simplistic) Python code to parse an HTTP path (virtual or path, or legacy) is as follows:
#--- Returns a tuple (bucket, key) ... or (False, False)
def s3_parse_url(url):
#--- Path Style (old style): https://s3.region-code.amazonaws.com/bucket-name/key-name
#-> ^^^^^^^^^^^ ^^^^^^^^^^^ ^^^^^^^^
regex_path = r"(?:https|s3):\/\/s3\.[a-z\-0-9]*\.amazonaws.com\/([^\/]*)\/([\s\S]*)$"
m = re.match(regex_path, url, re.IGNORECASE)
if m:
return m.group(1), m.group(2)
#--- Virtual Host Style: https://bucket-name.s3.region-code.amazonaws.com/key-name
# -> ^^^^^^^^^^^ ^^^^^^^^
# .. and note legacy style with dash after s3 instead of dot
regex_virtual = r"(?:https|s3):\/\/([^\.]*)\.s3[\-\.][a-z\-0-9]*\.amazonaws.com\/([\s\S]*)$"
m = re.match(regex_virtual, url, re.IGNORECASE)
if m:
return m.group(1), m.group(2)
#--- Legacy global endpoint: https://bucket-name.s3.amazonaws.com/key-name
# -> ^^^^^^^^^^^ ^^^^^^^^
regex_legacy = r"(?:https|s3):\/\/([^\.]*)\.s3\.amazonaws.com\/([\s\S]*)$"
m = re.match(regex_legacy, url, re.IGNORECASE)
if m:
return m.group(1), m.group(2)
#--- Neither, so fail
return False, False
Upvotes: 0
Reputation: 2468
We might want to re-use the code used by the AWS CLI. Unfortunately it's not part of Boto3 itself.
def find_bucket_key(s3_path):
"""
This is a helper function that given an s3 path such that the path is of
the form: bucket/key
It will return the bucket and the key represented by the s3 path
"""
block_unsupported_resources(s3_path)
match = _S3_ACCESSPOINT_TO_BUCKET_KEY_REGEX.match(s3_path)
if match:
return match.group('bucket'), match.group('key')
match = _S3_OUTPOST_TO_BUCKET_KEY_REGEX.match(s3_path)
if match:
return match.group('bucket'), match.group('key')
s3_components = s3_path.split('/', 1)
bucket = s3_components[0]
s3_key = ''
if len(s3_components) > 1:
s3_key = s3_components[1]
return bucket, s3_key
(I'm reproducing the code here to basically shine a light on the short-comings of the proposed solutions - that is missing access point and output support)
Upvotes: 0
Reputation: 103
This can be done smooth
bucket_name, key = s3_uri[5:].split('/', 1)
Upvotes: 8
Reputation: 19
The simplest I do is:
s = 's3://bucket/path1/path2/file.txt'
s1 = s.split('/', 3)
bucket = s1[2]
object_key = s1[3]
Upvotes: 1
Reputation: 2553
A more recent option is to use cloudpathlib
, which implements pathlib
functions for files on cloud services (including S3, Google Cloud Storage and Azure Blob Storage).
In addition to those functions, it's easy to get the bucket and the key for your S3 paths.
from cloudpathlib import S3Path
path = S3Path("s3://bucket_name/folder1/folder2/file1.json")
path.bucket
#> 'bucket_name'
path.key
#> 'folder1/folder2/file1.json'
Upvotes: 13
Reputation: 89
I use the following regex:
^(?:[s|S]3:\/\/)?([a-zA-Z0-9\._-]+)(?:\/)(.+)$
If match, then S3 parsed parts as follows:
This pattern handles bucket path with or without s3://
uri prefix.
If want to allow other legal bucket name chars, modify [a-zA-Z0-9_-]
part of pattern to include other chars as needed.
Complete JS example (in Typescript form)
const S3_URI_PATTERN = '^(?:[s|S]3:\\/\\/)?([a-zA-Z0-9\\._-]+)(?:\\/)(.+)$';
export interface S3UriParseResult {
bucket: string;
name: string;
}
export class S3Helper {
/**
*
* @param uri
*/
static parseUri(uri: string): S3UriParseResult {
const re = new RegExp(S3_URI_PATTERN);
const match = re.exec(uri);
if (!match || (match && match.length !== 3)) {
throw new Error('Invalid S3 object URI');
}
return {
bucket: match[1],
name: match[2],
};
}
}
Upvotes: 0
Reputation: 1607
Pretty easy to accomplish with a single line of builtin string methods...
s3_filepath = "s3://bucket-name/and/some/key.txt"
bucket, key = s3_filepath.replace("s3://", "").split("/", 1)
Upvotes: 27
Reputation: 161
This is a nice project:
s3path is a pathlib extention for aws s3 service
>>> from s3path import S3Path
>>> path = S3Path.from_uri('s3://bucket_name/folder1/folder2/file1.json')
>>> print(path.bucket)
'/bucket_name'
>>> print(path.key)
'folder1/folder2/file1.json'
>>> print(list(path.key.parents))
[S3Path('folder1/folder2'), S3Path('folder1'), S3Path('.')]
Upvotes: 7
Reputation: 137
Here it is as a one-liner using regex:
import re
s3_path = "s3://bucket/path/to/key"
bucket, key = re.match(r"s3:\/\/(.+?)\/(.+)", s3_path).groups()
Upvotes: 4
Reputation: 34744
Since it's just a normal URL, you can use urlparse
to get all the parts of the URL.
>>> from urlparse import urlparse
>>> o = urlparse('s3://bucket_name/folder1/folder2/file1.json', allow_fragments=False)
>>> o
ParseResult(scheme='s3', netloc='bucket_name', path='/folder1/folder2/file1.json', params='', query='', fragment='')
>>> o.netloc
'bucket_name'
>>> o.path
'/folder1/folder2/file1.json'
You may have to remove the beginning slash from the key as the next answer suggests.
o.path.lstrip('/')
With Python 3 urlparse
moved to urllib.parse
so use:
from urllib.parse import urlparse
Here's a class that takes care of all the details.
try:
from urlparse import urlparse
except ImportError:
from urllib.parse import urlparse
class S3Url(object):
"""
>>> s = S3Url("s3://bucket/hello/world")
>>> s.bucket
'bucket'
>>> s.key
'hello/world'
>>> s.url
's3://bucket/hello/world'
>>> s = S3Url("s3://bucket/hello/world?qwe1=3#ddd")
>>> s.bucket
'bucket'
>>> s.key
'hello/world?qwe1=3#ddd'
>>> s.url
's3://bucket/hello/world?qwe1=3#ddd'
>>> s = S3Url("s3://bucket/hello/world#foo?bar=2")
>>> s.key
'hello/world#foo?bar=2'
>>> s.url
's3://bucket/hello/world#foo?bar=2'
"""
def __init__(self, url):
self._parsed = urlparse(url, allow_fragments=False)
@property
def bucket(self):
return self._parsed.netloc
@property
def key(self):
if self._parsed.query:
return self._parsed.path.lstrip('/') + '?' + self._parsed.query
else:
return self._parsed.path.lstrip('/')
@property
def url(self):
return self._parsed.geturl()
Upvotes: 190
Reputation: 1020
For those who like me was trying to use urlparse to extract key and bucket in order to create object with boto3. There's one important detail: remove slash from the beginning of the key
from urlparse import urlparse
o = urlparse('s3://bucket_name/folder1/folder2/file1.json')
bucket = o.netloc
key = o.path
boto3.client('s3')
client.put_object(Body='test', Bucket=bucket, Key=key.lstrip('/'))
It took a while to realize that because boto3 doesn't throw any exception.
Upvotes: 36
Reputation: 547
A solution that works without urllib or re (also handles preceding slash):
def split_s3_path(s3_path):
path_parts=s3_path.replace("s3://","").split("/")
bucket=path_parts.pop(0)
key="/".join(path_parts)
return bucket, key
To run:
bucket, key = split_s3_path("s3://my-bucket/some_folder/another_folder/my_file.txt")
Returns:
bucket: my-bucket
key: some_folder/another_folder/my_file.txt
Upvotes: 47
Reputation: 824
If you want to do it with regular expressions, you can do the following:
>>> import re
>>> uri = 's3://my-bucket/my-folder/my-object.png'
>>> match = re.match(r's3:\/\/(.+?)\/(.+)', uri)
>>> match.group(1)
'my-bucket'
>>> match.group(2)
'my-folder/my-object.png'
This has the advantage that you can check for the s3
scheme rather than allowing anything there.
Upvotes: 9