CODEWITHSUNDEEP

hadoopcascading

Reputation: 11

Hadoop Cascading framework to Update specific column data

I have a mongodb collection which looks like this

Id  Name    createTime  updateTime  Age Country verificationStatus
Id1 Abc 10-7-2013   10-7-2013   21  Xxxx    INITIAL_MAIL
Id2 Efg 9-7-2013    10-7-2013   22  Xxxx    FIRST_REMINDER
Id3 Hij 8-7-2013    10-7-2013   45  Xxxx    INITIAL_MAIL

I have a cascading job which does some evaluation from another collection and I want to update just “verificationStatus” and “updateTime” columns by “Id” without disturbing the other columns

But in cascading if I set these two columns I am losing the other column data. I am left with something like this.

Id  updateTime  verificationStatus
Id1 11-7-2013   BLOCKED
Id2 11-7-2013   SECOND_REMINDER
Id3 11-7-2013   FIRST_REMINDER

SinkMode UPDATE works well for updating transaction by transaction but not individual column data.

How can I approach this issue?

PS: Join or Merge doesn’t work. Since Source and Sink cannot point to the same collection by casacading design.

Upvotes: 0

Views: 104

Answers (1)

Engineiro

Reputation: 1146

Option 1:

Write a cascading Function that updates these two columns above and pass in the Function and the original fields into a Pipe and use Fields.REPLACE to replace the columns with the new column values.

Option 2:

You could create two Pipes one with the original column data that you want to keep that includes the id field you mention in your post and another Pipe that updates those columns and then use a CoGroup to bring these datasets back together.

Upvotes: 1

Related Questions