Reputation: 1979
I am trying to migrate our organization's hadoop jobs to GCP...I am confused between GCP Data Flow and Data Proc...
I want to re-use Hadoop jobs we already have created and minimize the management of the cluster as much as possible. We also want to be able to persist data beyond the life of the cluster...
Can anyone suggest
Upvotes: 0
Views: 986
Reputation: 3217
I would just start with DataProc as it is very close to what you have.
Check out DataProc initialization actions, https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions, create a simple cluster and get a feel for it.
DataFlow is completely managed and you don't operate any cluster resources, but at the same time you cannot migrate an onsite cluster to DataFlow as is, you need to migrate (some times rewrite) your Hive/Pig/Oozie etc.
Cost for DataFlow is also calculated differently, though there is no upfront cost vs DataProc, everytime you run a job you incur some cost associated with it on DataFlow.
Upvotes: 1
Reputation: 1651
A lot depends on the nature of your Hadoop jobs and the activities you are performing in regards to the selection of Cloud Dataproc (managed big data platform - orientation of Hadoop/Spark) and/or Cloud Dataflow (managed big data platform - orientation of Apache Beam for streaming use cases).
In regards to ensuring persistence of data beyond the operation, you may want to consider storing your data on GCS or on PD's if that's an option basis the need of your use case.
Upvotes: 1