Nikita P
Nikita P

Reputation: 81

Google Storage // Cloud Function // Python Modify CSV file in the Bucket

thanks for reading.

I've some problem with touching csv file in Bucket, i know how i can copy/rename/move file, but i have no idea how to modify file with out downloading to local machine.

Actually i have major idea , its download blob (csv file) as bytes then modify and upload to the Bucket as bytes. But i don't understand how to modify bytes.

How i should touch csv : add new header - date , and add value (today.date) in each row of csv

---INPUT--- CSV file in the Bucket:

a b
1 2

--OUTPUT--- updated CSV file in the Bucket:

a b date
1 2 today

my code :

def addDataToCsv(bucket,fileName):
    today = str(date.today())

    bucket = storage_client.get_bucket(bucket)
    blob = bucket.blob(fileName)
    fileNameText = blob.download_as_string()
    
    /// This should be a magic bytes modification //

    blobNew = bucket.blob(path+'/'+'mod.csv')
    blobNew.upload_from_string(fileNameText,content_type='text/csv')


Please help, thank you for time and effort

Upvotes: 0

Views: 1153

Answers (2)

Alex L
Alex L

Reputation: 157

If I understand, you want to modify the CSV file in the bucket without downloading it to the local machine file-system.

You cannot directly edit a file from a Cloud Storage Bucket, aside from its metadata, therefore you will need to download it to your local machine somehow and push changes to the bucket.

Objects are immutable, which means that an uploaded object cannot change throughout its storage lifetime.

However, an approach would be to use Cloud Storage FUSE, which mounts a Cloud Storage bucket as a file system so you can edit any file from there and changes are applied to your bucket.

Still if this is not a suitable solution for you, the bytes can be downloaded and modified as you propose by decoding the bytes object (commonly using UTF-8, although depends on your characters) and reencoding it before uploading it.

# Create an array of every CSV file line
csv_array = fileNameText.decode("utf-8").split("\n")
# Add header
csv_array[0] = csv_array[0] + ",date\n"
# Add the date to each field
for i in range(1,len(csv_array)):
    csv_array[i] = csv_array[i] + "," + today + "\n"
# Reencode from list to bytes to upload
fileNameText = ''.join(csv_array).encode("utf-8")

Take into account that if your local machine has some serious storage or performance limitations, if your CSV is large enough that it might cause problems handling it like above, or just for reference, you could use the compose command. For this you would need to modify the code above so only some sections of the CSV file are edited every time, uploaded, and then joined by gsutil compose in Cloud Storage.

Upvotes: 1

MBHA Phoenix
MBHA Phoenix

Reputation: 2207

Sorry I know I'm not at your shoes, but if I were you I will try to keep things simple. In deed most systems work best if they are kept simple and they are easier to maintain and share (KISS principle). So given you are using your local machine, I assume you have a generous network bandwidth and enough disk space and memory. So I will not hesitate to download the file, modify it, and upload it again. Even when dealing with big files.

Then, if your are willing to use another format of the file:

download blob (csv file) as bytes

In this case a better solution for size and simple code, is to use / convert your file to Parquet or Avro format. These formats will reduce drastically you file size, especially if you add compression. Then they allow you to keep a structure for your data, which makes their modifications way simpler. Finally you have many resources on the net on how to use these formats with python, and comparisons between CSV, Avro and Parquet.

Upvotes: 0

Related Questions