Reputation: 2161
Although they resemble files, objects in Amazon S3 aren't really "files", just like S3 buckets aren't really directories. On a Unix system I can use head
to preview the first few lines of a file, no matter how large it is, but I can't do this on a S3. So how do I do a partial read on S3?
Upvotes: 61
Views: 60889
Reputation: 8670
This example is if you are using aws cli utility ( bash command). To download first 1000 bytes of a file from public dataset named as Common Crawl
aws s3api get-object --bucket commoncrawl --key cc-index/collections/CC-MAIN-2024-26/indexes/cdx-00299.gz --range bytes=0-9999 output.gz
Upvotes: 0
Reputation: 111
get_object api has arg for partial read
s3 = boto3.client('s3')
resp = s3.get_object(Bucket=bucket, Key=key, Range='bytes={}-{}'.format(start_byte, stop_byte-1))
res = resp['Body'].read()
Upvotes: 9
Reputation: 2375
Using Python you can preview first records of compressed file.
Connect using boto.
#Connect:
s3 = boto.connect_s3()
bname='my_bucket'
self.bucket = s3.get_bucket(bname, validate=False)
Read first 20 lines from gzip compressed file
#Read first 20 records
limit=20
k = Key(self.bucket)
k.key = 'my_file.gz'
k.open()
gzipped = GzipFile(None, 'rb', fileobj=k)
reader = csv.reader(io.TextIOWrapper(gzipped, newline="", encoding="utf-8"), delimiter='^')
for id,line in enumerate(reader):
if id>=int(limit): break
print(id, line)
So it's an equivalent of a following Unix command:
zcat my_file.gz|head -20
Upvotes: 3
Reputation: 61
The AWS .Net SDK only shows only fixed-ended ranges are possible (RE: public ByteRange(long start, long end)
). What if I want to start in the middle and read to the end? An HTTP range of Range: bytes=1000-
is perfectly acceptable for "start at 1000 and read to the end" I do not believe that they have allowed for this in the .Net library.
Upvotes: 6
Reputation: 2161
S3 files can be huge, but you don't have to fetch the entire thing just to read the first few bytes. The S3 APIs support the HTTP Range:
header (see RFC 2616), which take a byte range argument.
Just add a Range: bytes=0-NN
header to your S3 request, where NN is the requested number of bytes to read, and you'll fetch only those bytes rather than read the whole file. Now you can preview that 900 GB CSV file you left in an S3 bucket without waiting for the entire thing to download. Read the full GET Object
docs on Amazon's developer docs.
Upvotes: 99