Reputation: 359
I'm new to GCS and Cloud Functions and would like to understand how I can do an lightweight ETL using these two technologies combined with Python (3.7).
I have a GCS bucket called 'Test_1233' containing 3 files (all structurally identical). When a new file is added to this gcs bucket, I would like the following python code to run and produce an 'output.csv file' and save in the same bucket. The code I'm trying to run is below:
import pandas as pd
import glob
import os
import re
import numpy as np
path = os.getcwd()
files = os.listdir(path) ## Originally this was intentended for finding files in the local directlory - I now need this adapted for finding files within gcs(!)
### Loading Files by Variable ###
df = pd.DataFrame()
data = pd.DataFrame()
for files in glob.glob('gs://test_1233/Test *.xlsx'): ## attempts to find all relevant files within the gcs bucket
data = pd.read_excel(files,'Sheet1',skiprows=1).fillna(method='ffill')
date = re.compile(r'([\.\d]+ - [\.\d]+)').search(files).groups()[0]
data['Date'] = date
data['Start_Date'], data['End_Date'] = data['Date'].str.split(' - ', 1).str
data['End_Date'] = data['End_Date'].str[:10]
data['Start_Date'] = data['Start_Date'].str[:10]
data['Start_Date'] =pd.to_datetime(data['Start_Date'],format ='%d.%m.%Y',errors='coerce')
data['End_Date']= pd.to_datetime(data['End_Date'],format ='%d.%m.%Y',errors='coerce')
df = df.append(data)
df
df['Product'] = np.where(df['Product'] =='BR: Tpaste Adv Wht 2x120g','ToothpasteWht2x120g',df['Product'])
##Stores cleaned data back into same gcs bucket as 'csv' file
df.to_csv('Test_Output.csv')
As I'm totally new to this, I'm not sure how I create the correct path to read all the files within the cloud environment (I used to read files from my local directory!).
Any help would be most appreciated.
Upvotes: 1
Views: 290
Reputation: 21570
You'll need to download/upload the files from Google Cloud Storage to your Cloud Function environment first, using the google-cloud-storage
module. See:
Upvotes: 0
Reputation: 317758
If you want to download files from somehwere and (temporarily) write them to local files in the Cloud Functions runtime, be sure you read the documentation:
The only writeable part of the filesystem is the /tmp directory, which you can use to store temporary files in a function instance. This is a local disk mount point known as a "tmpfs" volume in which data written to the volume is stored in memory. Note that it will consume memory resources provisioned for the function.
The rest of the file system is read-only and accessible to the function.
Or, you can just read and work with them directly into memory, as the file contents will consume memory either way.
Upvotes: 0