jm3
jm3

Reputation: 2161

S3: How to do a partial read / seek without downloading the complete file?

Although they resemble files, objects in Amazon S3 aren't really "files", just like S3 buckets aren't really directories. On a Unix system I can use head to preview the first few lines of a file, no matter how large it is, but I can't do this on a S3. So how do I do a partial read on S3?

Upvotes: 61

Views: 60889

Answers (5)

Hafiz Muhammad Shafiq
Hafiz Muhammad Shafiq

Reputation: 8670

This example is if you are using aws cli utility ( bash command). To download first 1000 bytes of a file from public dataset named as Common Crawl

aws s3api get-object --bucket commoncrawl --key cc-index/collections/CC-MAIN-2024-26/indexes/cdx-00299.gz --range bytes=0-9999 output.gz

Upvotes: 0

lambda
lambda

Reputation: 111

get_object api has arg for partial read

s3 = boto3.client('s3')
resp = s3.get_object(Bucket=bucket, Key=key, Range='bytes={}-{}'.format(start_byte, stop_byte-1))
res = resp['Body'].read()

Upvotes: 9

Alex B
Alex B

Reputation: 2375

Using Python you can preview first records of compressed file.

Connect using boto.

#Connect:
s3 = boto.connect_s3()
bname='my_bucket'
self.bucket = s3.get_bucket(bname, validate=False)

Read first 20 lines from gzip compressed file

#Read first 20 records
limit=20
k = Key(self.bucket)
k.key = 'my_file.gz'
k.open()
gzipped = GzipFile(None, 'rb', fileobj=k)
reader = csv.reader(io.TextIOWrapper(gzipped, newline="", encoding="utf-8"), delimiter='^')
for id,line in enumerate(reader):
    if id>=int(limit): break
    print(id, line)

So it's an equivalent of a following Unix command:

zcat my_file.gz|head -20

Upvotes: 3

Rick W
Rick W

Reputation: 61

The AWS .Net SDK only shows only fixed-ended ranges are possible (RE: public ByteRange(long start, long end) ). What if I want to start in the middle and read to the end? An HTTP range of Range: bytes=1000- is perfectly acceptable for "start at 1000 and read to the end" I do not believe that they have allowed for this in the .Net library.

Upvotes: 6

jm3
jm3

Reputation: 2161

S3 files can be huge, but you don't have to fetch the entire thing just to read the first few bytes. The S3 APIs support the HTTP Range: header (see RFC 2616), which take a byte range argument.

Just add a Range: bytes=0-NN header to your S3 request, where NN is the requested number of bytes to read, and you'll fetch only those bytes rather than read the whole file. Now you can preview that 900 GB CSV file you left in an S3 bucket without waiting for the entire thing to download. Read the full GET Object docs on Amazon's developer docs.

Upvotes: 99

Related Questions