kzeon
kzeon

Reputation: 23

Google Dataflow pricing

I've recently started to investigate Dataflow for a new project (great stuff and impressed by it so far!) but I had a reality check this morning while checking the billing page into the dev console.

I've started playing with Dataflow last week, launching all pipeline execution through Eclipse with the plugin. So far, I've launched the 42 following jobs:

Streaming ----- Nov 17, 2015, 3:20:37 PM ----- 12 min 20 sec
Streaming ----- Nov 17, 2015, 1:45:49 PM ----- 1 hr 36 min
Streaming ----- Nov 17, 2015, 1:25:25 PM ----- 21 min 0 sec
Streaming ----- Nov 17, 2015, 9:30:36 AM ----- 25 min 14 sec
Streaming ----- Nov 16, 2015, 4:44:09 PM ----- 29 min 27 sec
Streaming ----- Nov 16, 2015, 4:40:16 PM ----- 3 min 48 sec
Streaming ----- Nov 16, 2015, 4:37:32 PM ----- 3 min 33 sec
Streaming ----- Nov 16, 2015, 3:58:46 PM ----- 38 min 53 sec
Streaming ----- Nov 16, 2015, 3:46:18 PM ----- 12 min 59 sec
Streaming ----- Nov 16, 2015, 2:05:31 PM ----- 1 hr 41 min
Streaming ----- Nov 15, 2015, 4:28:06 PM ----- 21 hr 35 min
Streaming ----- Nov 13, 2015, 5:09:22 PM ----- 2 days 20 hr
Streaming ----- Nov 13, 2015, 4:30:34 PM ----- 2 days 21 hr
Streaming ----- Nov 13, 2015, 2:52:40 PM ----- 2 days 23 hr
Streaming ----- Nov 13, 2015, 2:42:27 PM ----- 10 min 20 sec
Streaming ----- Nov 13, 2015, 12:21:33 PM ----- 2 hr 19 min
Streaming ----- Nov 13, 2015, 12:12:24 PM ----- 9 min 24 sec
Streaming ----- Nov 13, 2015, 11:55:30 AM ----- 17 min 54 sec
Streaming ----- Nov 13, 2015, 11:51:49 AM ----- 4 min 28 sec
Streaming ----- Nov 13, 2015, 11:35:06 AM ----- 14 min 36 sec
Streaming ----- Nov 13, 2015, 11:32:51 AM ----- 3 min 2 sec
Streaming ----- Nov 13, 2015, 11:20:53 AM ----- 12 min 8 sec
Streaming ----- Nov 12, 2015, 2:11:08 PM ----- 20 hr 48 min
Streaming ----- Nov 12, 2015, 2:07:59 PM ----- 6 min 52 sec
Streaming ----- Nov 12, 2015, 1:24:33 PM ----- 50 min 15 sec
Streaming ----- Nov 12, 2015, 12:46:15 PM ----- 1 hr 28 min
Streaming ----- Nov 12, 2015, 12:43:59 PM ----- 1 hr 30 min
Streaming ----- Nov 12, 2015, 12:41:17 PM ----- 1 hr 33 min
Streaming ----- Nov 12, 2015, 12:36:44 PM ----- 5 min 32 sec
Streaming ----- Nov 12, 2015, 12:03:06 PM ----- 34 min 23 sec
Streaming ----- Nov 12, 2015, 11:55:00 AM ----- 8 min 55 sec
Streaming ----- Nov 12, 2015, 11:23:38 AM ----- 31 min 47 sec
Streaming ----- Nov 12, 2015, 11:07:25 AM ----- 16 min 30 sec
Streaming ----- Nov 12, 2015, 9:54:50 AM ----- 1 hr 11 min
Streaming ----- Nov 11, 2015, 5:10:36 PM ----- 16 hr 44 min
Streaming ----- Nov 11, 2015, 4:57:15 PM ----- 13 min 52 sec
Streaming ----- Nov 11, 2015, 4:48:52 PM ----- 3 min 59 sec
Streaming ----- Nov 11, 2015, 4:41:16 PM ----- 11 min 49 sec
Streaming ----- Nov 11, 2015, 4:32:01 PM ----- 21 min 6 sec
Batch ----- Nov 10, 2015, 3:36:09 PM ----- 1 min 37 sec
Batch ----- Nov 10, 2015, 2:41:28 PM ----- 1 min 48 sec
Batch ----- Nov 10, 2015, 2:37:17 PM ----- 1 min 39 sec

This was only test with a tiny amount of data, so nothing crazy at all than getting a few elements from PubSub to understand how the SDK and the environment works.

Google Compute  Dataflow Stream Processing VM running on Standard Intel N1 4 VCPU   51,192 Minutes  $140.78
Google Compute  Standard Intel N1 4 VCPU running in NA  51,192 Minutes  $170.64

(For the sake of simplifications, I'll ignore the 3 batch jobs that lasted less than 2 minutes each, they aren't really relevant to the following).

From all of this, a few questions:

1) Am I missing something in terms of elapsed time? 51 192 minutes is 853.2 hours, far more than the sum of all my jobs execution time. I do understand that an instance running is billed for at least 10 minutes, but even with this, I'm still far away from 51 192 minutes. Given the duration, 853.2 hours x 11 GCEU x 0.015 $/GCEU/hours = 140.78$ which goes along the billing statement, but I would like to better understand how the total duration is computed. EDIT: 51 192 minutes is about 3 times the sum of the execution of all my jobs. Is this factor of 3 related to the 3 workers I had configured?

2) Is it possible to configure the type of instances used by the pipeline? For the kind of test I was performing, using n1-standard-4 instances is really overkill for what I was trying to do. Is this configurable within the Eclipse plugin or console? EDIT: Found the answer to this one

3) I never really noticed before now that I had 3 workers starting every time I was starting a job. I never actually configured anything related to that. I guess this is the default number of workers for when creating a run configuration within Eclipse? EDIT: Found the answer as well

Upvotes: 1

Views: 1183

Answers (1)

Ben Chambers
Ben Chambers

Reputation: 6130

Thanks for trying Dataflow -- we're glad you like it!

  1. Elapsed time measures the GCE VM usage. As you mentioned in the edit, 3 workers = 3 VMs so there is a factor of 3 associated with the actual VM time.
  2. You can set the --workerMachineType option, as documented in Setting Other Cloud Pipeline Options.
  3. 3 is the default number of workers associated with a pipeline. It can be specified explicitly with --numWorkers, although that will prevent autoscaling from adjusting the number of workers as appropriate. As documented there you can use --maxNumWorkers instead to limit the upper bound, while allowing autoscaling to adjust the actual number of workers.

You may want to use the local runner to execute the pipeline on your machine during development. It sounds like the amount of data you are testing with is small enough that you don't need the scale of running on the service. You can use PubSubIO to create a bounded source which will work with the local runner by calling maxNumRecords or maxReadTime.

Upvotes: 7

Related Questions