Reputation: 43
I'm facing a data transformation issue : I have the table here under with 3 columns : client, event, timestamp. And I basically want to assign a sequence number to all events for a given client based on timestamp, which is basically the "Sequence" columns I added hereunder.
Client Event TimeStamp Sequence
C1 Ph 2014-01-30 12:15:23 1
C1 Me 2014-01-31 15:11:34 2
C1 Me 2014-01-31 17:16:05 3
C2 Me 2014-02-01 09:22:52 1
C2 Ph 2014-02-01 17:22:52 2
I can't figure out how to create this sequence number in hive or Pig. Would you have any clue ?
Thanks in advance !
Guillaume
Upvotes: 2
Views: 2029
Reputation: 979
Put all the records in a bag (by say grouping all), sort the tuples inside bag by TimeStamp field and then use Enumerate function.
Something like below (I did not execute the code, so you might need to clean it up a bit):
// assuming input contains 3 columns - client, event, timestamp
input2 = GROUP input all;
input3 = FOREACH input2
{
sorted = ORDER input BY timestamp;
sorted2 = Enumerate(sorted);
GENERATE FLATTEN(sorted2);
}
Upvotes: 4
Reputation: 1
We eventually modified enumerate source the following way and it works great :
public void accumulate(Tuple arg0) throws IOException {
nevents=13;
i=nevents+1;
DataBag inputBag = (DataBag)arg0.get(0);
Tuple t2 = TupleFactory.getInstance().newTuple();
for (Tuple t : inputBag) {
Tuple t1 = TupleFactory.getInstance().newTuple(t.getAll());
tampon=t1.get(2).toString();
if (tampon.equals("NA souscription Credit Conso")) {
if (i <= nevents) {
outputBag.add(t2);
t2=TupleFactory.getInstance().newTuple();
}
i=0;
t2.append(t1.get(0).toString());
t2.append(t1.get(1).toString());
t2.append(t1.get(2).toString());
i++;
}
else if (i < nevents) {
t2.append(tampon);
i++;
}
else if (i == nevents) {
t2.append(tampon);
outputBag.add(t2);
i++;
t2=TupleFactory.getInstance().newTuple();
}
if (count % 1000000 == 0) {
outputBag.spill();
count = 0;
}
;
count++;
}
if (t2.size()!=0) {
outputBag.add(t2);
}
}
Upvotes: 0