user2325219
user2325219

Reputation: 1

Cassandra Hadoop - is it possible to read and write to the same column family

Using Cassandra 1.1, is it possible to have a Hadoop job that reads from Column Family X, and "updates" to it at the same time. That is, specify X as the Input Column Family and then

  1. in the map step, update to the same CF (e.g. via Hector).
  2. or if #1 is not possible, update to the same CF in the reduce step (directly via Hector, or alternatively by specifying the CF as the output column family).

What we are trying to do is this: we have (potentially very wide) rows that we will be reading in. In the map() method, we iterate through the columns of that row, and as each column is processed, we no longer need the column, so we plan to "expire" it by updating it in Cassandra with TTL = 1 sec.

If it's not possible or advisable to do that in the map step, then we are prepared to do that in the reduce step. However, we prefer do it in the map step, since doing it in the reduce step means that we would need to provide the reduce() method with enough info to identify the row+col we are trying to expire. And this would mean that our map step would need to include this info in its output, something we are trying to avoid if possible.

So again, is it possible to do this using either #1 or #2 ?

Upvotes: 0

Views: 309

Answers (1)

odiszapc
odiszapc

Reputation: 4109

First, you can do anything in your map or reduce steps. So, yes, it is possible.

It's possible to write to the same column family in Reduce step, because Map and Reduce steps are executend sequentially. Feel free to update any column family in Reduce step.

About Map: it is possible to write to the same column family in Map step with Hector/Thrift API, but this is a bad practic. First, because Map step is designed for reading data only. In Map step you proceed with iterating rows - it's working rapidly with the fast low-level Cassandra reader implementation in Hadoop. With Hector your Map step will be much slower.

If data you want to delete in a Map step will never be used in next steps, you can, but I repeat - writing to dataset you iterating in a Map step is a bad practic. If you map-reduce job fails (for any reason) you garbage collected data in map step may become corrupted (they was deleted in map step, but reducer will never see them because of job fail).

Map-Reduce rule: All the data you iterating should be modified in a successive manner. First iterate dataset, then modify. Don't do this simultaneously

Answering your question, it is possible in both cases anyway, but #2 is valid. You should use Reduce step to write/delete operations.

P.S. It seems you are trying to use Hadoop as Garbage collector - it's not the approach it was designed for.

Upvotes: 1

Related Questions