molecule
molecule

Reputation: 1121

How can I make a non atomic batch (or equivalent) statement in cassandra

I am using the DataStax Nodejs driver from Cassandra and what I want to do is to avoid very frequent I/O operations that will happen for inserts in my application. I will be doing around 1000 inserts per second and want to group all together and perform 1 I/O instead of running individual queries which will cause 1000 I/Os. I came across batch statements like below,

const query1 = 'UPDATE user_profiles SET email = ? WHERE key = ?';
const query2 = 'INSERT INTO user_track (key, text, date) VALUES (?, ?, ?)';
const queries = [
   { query: query1, params: [emailAddress, 'hendrix'] },
   { query: query2, params: ['hendrix', 'Changed email', new Date()] } 
];
client.batch(queries, { prepare: true }, function (err) {
   // All queries have been executed successfully
   // Or none of the changes have been applied, check err
});

The problem here is that they are atomic. I want other statements to be successful even if one of them fail. Is there something that I can do to achieve that ?

Upvotes: 1

Views: 671

Answers (1)

Christophe Schmitz
Christophe Schmitz

Reputation: 2996

Batch statement across multiple partitions (which is the case with your write statements) are by default using LOGGED batch. This means that you have this atomicity property. If you really want to remove the atomicity part, you should use UNLOGGED batch. You should be aware, however, that UNLOGGED batch across multiple partitions is an anti-pattern https://issues.apache.org/jira/browse/CASSANDRA-9282. Let me try to explain:

When using batch statement, you have 4 possible cases:

  • is your batch against a single partition, or multiple partitions? (which is your case)
  • is your batch using LOGGED or UNLOGGED batch? LOGGED ensure atomicity (all or none operation will succeed). LOGGED bath are more costly.

Let's consider the 4 options:

  1. single partition, LOGGED batch. You use this when you want to achieve atomicity of your writes against the single partition. This atomicity has a cost. So use that only if you need it.
  2. single partition, UNLOGGED batch. You use this when you don't need atomicity, it is faster. If your application is correctly configured (tokenaware), your batch statement will choose a replica (for this partition) as coordinator, and you will have a performance boost. That's the only legitimate reason to use UNLOGGED batch. By default, batch against the same partition is UNLOGGED.
  3. multiple partitions, LOGGED batch. The only reason to batch queries hitting different partitions is to ensure atomicity. By default, batch against multiple partitions is LOGGED.
  4. multiple partitions, UNLOGGED batch. This is an anti-pattern because it brings no functional value (no atomicity), and no performance benefit (multiple partitions are involved, the coordinator will have the overhead of contacting replicas that are responsible for the partition, inducing extra work).

To make it more concrete, when you issue what you call 'a single IO' batch statement across multiple partitions, the coordinator will have to slice your 'single IO' into 1000 of IO anyway (it wouldn't be the case if all the write were on the same partition), and coordinate that accross multiple replicas.

To conclude, you might observe a perf improvement on your client side, but you will induce a much larger cost at the Cassandra side.

You might want to read the following blog post: http://batey.info/cassandra-anti-pattern-misuse-of.html and in particular, the section cometing the use of UNLOGGED batch against multiple partitions:

What this is actually doing is putting a huge amount of pressure on a single coordinator. This is because the coordinator needs to forward each individual insert to the correct replicas. You're losing all the benefit of token aware load balancing policy as you're inserting different partitions in a single round trip to the database.

Upvotes: 4

Related Questions