Reputation: 87
I have a cluster like https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/cluster_setup.html and in which worker I have multiple CSV corresponding to a shard for each host. I want to use the table API to calculate a sum of a CSV column across multiple hosts. Each worker should be able to calculate the sum of the CSV that he has and return the result on the master. Is it possible and if it is what should I implement.
Upvotes: 0
Views: 160
Reputation: 18987
If I understand your question correctly, you'd like to read CSV files and sum up some fields. That's a rather simple query and not a problem for Flink.
With the latest Flink version (1.4.2), you can register a CsvTableSource
as a table and run a query like SELECT sum(a), sum(b) FROM yourTable
.
Note that the CSV files should be stored in file system that is accessible from all machines (distributed file system, NFS, ...).
Upvotes: 1