Hamza
Hamza

Reputation: 6025

How to schedule importing data files from SFTP server located on compute engine instance into BigQuery?

What I want to achieve:

Transfer hourly coming data files onto a SFTP file server located on a compute engine VM from several different feeds into Bigquery with real-time updates effectively & cost-efficiently.

Context:

  1. The software I am trying to import data from is an old legacy software and does not support direct exports to cloud. So direct connection from software to cloud isn't an option.

  2. It does however support exporting data to a SFTP server. Which is not available via any GCP tools directly.

  3. So I have setup a manual SFTP server using vsftpd on a compute engine VM instance with expandable storage then giving it a static IP and hardwired that IP into my software. Data now comes to the compute engine instance on hourly intervals seamlessly.

  4. Files are generated on hourly basis. Thus a different file for each hour. However they might contain some duplication .i.e. some of the end records of previous hour's file may overlap with the beginning of the current hour's file.

  5. Files are coming from different source feeds and I have feed names in the filenames so ever-growing data on my compute engine VM instance look like:

    feed1_210301_0500.csv
    feed2_210301_0500.csv
    feed3_210301_0500.csv
    feed1_210301_0600.csv
    feed2_210301_0600.csv
    feed3_210301_0600.csv
    feed1_210301_0700.csv
    feed2_210301_0700.csv
    feed3_210301_0700.csv
    ...
    

What I have tried:

  1. I have set Bigquery access & cloud storage permissions within VM instance to access data from VM instance onto BigQuery:

enter image description here

  1. I have tried importing data into BigQuery directly as well as on google cloud storage to import data from there and yet there is no option to directly import data from VM instance to BigQuery nor I can somehow import data from VM to GCS then to load into BigQuery but there is no option for that and documentation is silent on the matter of scheduled transfers as well.

  2. There are some external data transfer services like Fivetran and HevoData but they are relatively expensive and also seem much of a overkill as both my source or destination is on the GCP and it wont be much different than having a third VM and scheduling some scripts for imports. (Which BTW is my current workaround :D i.e. Using python scripts to stream data into BigQuery as explained here)

  3. Currently I am exploring DataFusion which is only free for 120 hrs each month, has extra costs for underlying Dataprep pipelines and not sure if its the right way to go about. Also I am currently exploring tools like Cloud Scheduler & Cloud Composer to see if any fits my data needs but as of now could not find a viable solution.

I am happy to learn any new tools and technologies and any advice bettering the situation in anyway is also appreciated.

Upvotes: 1

Views: 1140

Answers (2)

Ksign
Ksign

Reputation: 817

You could use GCS on-premise transfer (you can schedule it) => then schedule a GCS transfer to BigQuery.

If nor this, nor external data transfer services work for you, then I believe that your best bet is to create a script to schedule a batch load of the data from your VM to BigQuery.

Maybe this other answer might help you as well.

Upvotes: 0

jabbson
jabbson

Reputation: 4913

I just tried uploading directly from the GCE VM and it worked flawlessly. I've enabled BigQuery in Cloud API access scopes, created the file (test_data.csv) with some random data, that satisfies the schema of the table (test_table) that I have in BigQuery table dataset (test_dataset) and ran:

bq load test_dataset.test_table test_data.csv

Upvotes: 0

Related Questions