Jonathan Santilli
Jonathan Santilli

Reputation: 171

Storage in Apache Flink

After processing those millions of events/data, where is the best place to storage the information to say that worth to save millions of events? I saw a pull request closed by this commit mentioning Parquet formats, but, the default is the HDFS? My concern is after saving (where?) if it is easy (fast!) to retrieved that data?

Upvotes: 5

Views: 2470

Answers (1)

Fabian Hueske
Fabian Hueske

Reputation: 18987

Apache Flink is not coupled with specific storage engines or formats. The best place to store the results computed by Flink depends on your use case.

  • Are you running a batch or streaming job?
  • What do you want to do with the result?
  • Do you need batch (full scan), point, or continuously streaming access to the data?
  • What format does the data have? flat structured (relational), nested, blob, ...

Depending on the answer to these questions, you can choose from various storage backends such as - Apache HDFS for batch access (with different storage format such as Parquet, ORC, custom binary) - Apache Kafka if you want to access the data as a stream - a key-value store such as Apache HBase and Apache Cassandra for point access to data - a database such as MongoDB, MySQL, ...

Flink provides OutputFormats for most of these systems (some through a wrapper for Hadoop OutputFormats). The "best" system depends on your use case.

Upvotes: 10

Related Questions