Reputation: 19883
I have the following use case: a large volume of structured data that I will need to analyze using PySpark. The data is currently in CSV format. I am trying to figure out the best way to store the data on google cloud. My understanding is HDFS won't work because every time the cluster shuts down the HDFS data will disappear so I'll have to convert from CSV to HDFS every time which is time consuming. It seems like the right strategy is to go with BigQuery but I can't determine if BigQuery is persistent or not?
Upvotes: 0
Views: 122
Reputation: 14014
If you plan to process your data only with PySpark - you will be better off storing the files in Google Cloud Storage rather than in BigQuery. Even managed Google Cloud's Spark (DataProc) cannot read from BigQuery storage as efficiently as it can from Google Cloud Storage.
Upvotes: 2
Reputation: 2406
BigQuery is persistent yes, then what you should check is if the SLA is good for you. The SLA, for now, is >= 99.9% data availability monthly.
You can also store in a bucket, in Google Cloud Storage. You have different prices in function of how often you want to access that data: https://cloud.google.com/storage/
Google helps you to choose your storage option, take a look at that page of their doc: https://cloud.google.com/storage-options/
Upvotes: 2
Reputation: 33765
Yes, BigQuery is persistent, though you can also control the table expiration time. To load the CSV files into BigQuery, you can create a table from them by pointing to their location on GCS, assuming that you have copied the files there. There are a variety of third-party connectors that can help with getting your data to GCS, and there is a Data Transfer Service provided by the BigQuery team to help automate transferring your data.
Upvotes: 2