zkta
zkta

Reputation: 105

Read header csv python from GCS bucket

I want to extract header from a csv inside cloud storage GCP. The problem is I extracted header, but I have a csv file with more than 20GB.

I used a library. It works to extract header, but it takes to much memory.

import gcsfs

fs = gcsfs.GCSFileSystem(project=PROJECT)
with fs.open(f'{bucket}/{file}', 'rb') as f:
    schema = f.read().decode("utf-8") 
    # Remove all words after the first new line
    schema = re.sub("(\\n).*", "", schema)

I tried this command too but it returns nothing:

fs.read_block('gs://my-bucket/my-file.txt', offset=1000, length=10, delimiter=b'\n')

My question is how to read only header not all file.

Upvotes: 0

Views: 870

Answers (1)

mkrieger1
mkrieger1

Reputation: 23218

schema = f.read()

This reads the whole file. Presumably, if gcsfs.GCSFileSystem.open works like the built-in file open, it should take an integer argument that specifies the number of bytes to read.

For example, if the header is 100 bytes in size, try:

schema = f.read(100)

Or, if the header is the first line in the file, separated by a \n character, try

schema = f.readline()

Upvotes: 1

Related Questions