Ryan Yuan
Ryan Yuan

Reputation: 2556

Dataprep vs Dataflow vs Dataproc

To perform source data preparation, data transformation or data cleansing, in what scenario should we use Dataprep vs Dataflow vs Dataproc?

Upvotes: 10

Views: 16899

Answers (3)

honeytechiebee
honeytechiebee

Reputation: 162

an Important note about Dataproc is, Dataprep provides data cleaning and automatically identifies anomalies in the data. It is integrated with Cloud Storage, BigTable and and BigQuery

Upvotes: -3

ama
ama

Reputation: 53

Both Dataproc and Dataflow are data processing services on google cloud. What is common about both systems is they can both process batch or streaming data. Both also have workflow templates that are easier to use. But below are the distinguishing features about the two

Dataproc is designed to run on clusters. Which makes it compatible with Apache Hadoop, hive and spark. It is significantly faster at creating clusters and can auto scale clusters without interruption of running job.

Dataflow is better if your data has no implementation with spark or Hadoop. It does not run on clusters, instead it is based on parallel data processing. As such data is split processed on multiple microprocessors to reduce processing time.

Upvotes: 3

Lefteris S
Lefteris S

Reputation: 1672

Data preparation/transformation/cleaning tasks can all be seen as ETL processes, implementable with any of the products you mention. This older answer covers the basics of the Dataflow vs Dataproc question and includes this link which summarises what you should keep in mind when choosing between these three.

In brief, you should consider familiarity (have you already worked with Hadoop-ecosystem tools? the beam programming model? would you rather work via a UI?) and desired level of control (dataproc allows more control over the cluster, dataflow and dataprep are fully managed services).

More good reads:

Upvotes: 10

Related Questions