Alex
Alex

Reputation: 19883

storage on google cloud

I have the following use case: a large volume of structured data that I will need to analyze using PySpark. The data is currently in CSV format. I am trying to figure out the best way to store the data on google cloud. My understanding is HDFS won't work because every time the cluster shuts down the HDFS data will disappear so I'll have to convert from CSV to HDFS every time which is time consuming. It seems like the right strategy is to go with BigQuery but I can't determine if BigQuery is persistent or not?

Upvotes: 0

Views: 122

Answers (3)

Mosha Pasumansky
Mosha Pasumansky

Reputation: 14014

If you plan to process your data only with PySpark - you will be better off storing the files in Google Cloud Storage rather than in BigQuery. Even managed Google Cloud's Spark (DataProc) cannot read from BigQuery storage as efficiently as it can from Google Cloud Storage.

Upvotes: 2

nlassaux
nlassaux

Reputation: 2406

BigQuery is persistent yes, then what you should check is if the SLA is good for you. The SLA, for now, is >= 99.9% data availability monthly.

You can also store in a bucket, in Google Cloud Storage. You have different prices in function of how often you want to access that data: https://cloud.google.com/storage/

Google helps you to choose your storage option, take a look at that page of their doc: https://cloud.google.com/storage-options/

Upvotes: 2

Elliott Brossard
Elliott Brossard

Reputation: 33765

Yes, BigQuery is persistent, though you can also control the table expiration time. To load the CSV files into BigQuery, you can create a table from them by pointing to their location on GCS, assuming that you have copied the files there. There are a variety of third-party connectors that can help with getting your data to GCS, and there is a Data Transfer Service provided by the BigQuery team to help automate transferring your data.

Upvotes: 2

Related Questions