Reputation: 105
I want to extract header from a csv inside cloud storage GCP. The problem is I extracted header, but I have a csv file with more than 20GB.
I used a library. It works to extract header, but it takes to much memory.
import gcsfs
fs = gcsfs.GCSFileSystem(project=PROJECT)
with fs.open(f'{bucket}/{file}', 'rb') as f:
schema = f.read().decode("utf-8")
# Remove all words after the first new line
schema = re.sub("(\\n).*", "", schema)
I tried this command too but it returns nothing:
fs.read_block('gs://my-bucket/my-file.txt', offset=1000, length=10, delimiter=b'\n')
My question is how to read only header not all file.
Upvotes: 0
Views: 870
Reputation: 23218
schema = f.read()
This reads the whole file. Presumably, if gcsfs.GCSFileSystem.open
works like the built-in file open
, it should take an integer argument that specifies the number of bytes to read.
For example, if the header is 100 bytes in size, try:
schema = f.read(100)
Or, if the header is the first line in the file, separated by a \n
character, try
schema = f.readline()
Upvotes: 1