statquant
statquant

Reputation: 14370

Find first/last row "recursively" by group

I try to find a efficient way of finding the first and last line by group.

R) ex=data.table(state=c("az","fl","fl","fl","fl","fl","oh"),city=c("TU","MI","MI","MI","MI","MI","MI"),code=c(85730,33133,33133,33133,33146,33146,45056))
R) ex
   state city  code
1:    az   TU 85730           
2:    fl   MI 33133           
3:    fl   MI 33133           
4:    fl   MI 33133           
5:    fl   MI 33146           
6:    fl   MI 33146           
7:    oh   MI 45056           

I would like to find the first and last for each variable of a group

R) ex
   state city  code first.state last.state first.city last.city first.code last.code
1:    az   TU 85730           1          1          1         1          1         1
2:    fl   MI 33133           1          0          1         0          1         0
3:    fl   MI 33133           0          0          0         0          0         0
4:    fl   MI 33133           0          0          0         0          0         1
5:    fl   MI 33146           0          0          0         0          1         0
6:    fl   MI 33146           0          1          0         1          0         1
7:    oh   MI 45056           1          1          1         1          1         1

As far as I know data.table cannot easily help for things like this because by="state,city,code" would look at 4 triplets.

The only way I know would be to look for first/last.code in a by="state,city,code" then first/last.city in a by="state,city".


This is what I meant:

applyAll <- function(DT, by){
    f<- function(n, vec){ return(vec[1:n]) }
    by <- lapply(1:length(by), FUN=f, by)
    out <- Reduce(f=firstLast, init=DT, x=by)
    return(out)
}
firstLast <- function(DT, by){
    addNames <- paste(c("first", "last"),by[length(by)], sep=".")
    DT[DT[,list(IDX=.I[1]), by=by]$IDX, addNames[1]:=1]
    DT[DT[,list(IDX=.I[.N]), by=by]$IDX, addNames[2]:=1]
    return(DT);
}

Result by: applyAll(ex,c("state","city","code")) but this would make NUMEROUS copies of DT, my question is then, is there someting scheduled or already existing such that we cant get first/last by groups. (This is fairly vanilla for SAS or kdb or SQL)

In SAS:

data DT;
    set ex;
    by state city code;
    if first.code then firstcode=1;
    if last.code then lastcode=1;
    if first.city then firstcity=1;
    if last.city then lastcity=1;
    if first.state then firststate=1;
    if last.state then laststate=1;
run;

Upvotes: 1

Views: 588

Answers (2)

Matt Dowle
Matt Dowle

Reputation: 59612

If this is the question :

For a set of columns (x,y,z) I'd like to add an integer column marking the position of the first item of each group by="x", by="x,y" and by="x,y,z" (three new columns). The 1st row of each new column will always be 1 because that's always the first item of the first group. I'd also like to add a further 3 columns marking the last item by each of the same 3 groupings. I might have many more than just 3 groupings, though, so is something programatic possible please?

then how about :

ex=data.table(state=c("az","fl","fl","fl","fl","fl","oh"),
              city=c("TU","MI","MI","MI","MI","MI","MI"),
              code=c(85730,33133,33133,33133,33146,33146,45056))
ex
   state city  code
1:    az   TU 85730
2:    fl   MI 33133
3:    fl   MI 33133
4:    fl   MI 33133
5:    fl   MI 33146
6:    fl   MI 33146
7:    oh   MI 45056

cols = c("state","city","code")
for (i in seq_along(cols)) {
  ex[,paste0("f.",cols[i]):=c(1L,rep(0L,.N-1L)),by=eval(head(cols,i))] # first
  ex[,paste0("l.",cols[i]):=c(rep(0L,.N-1L),1L),by=eval(head(cols,i))] # last
}
ex
   state city  code f.state l.state f.city l.city f.code l.code
1:    az   TU 85730       1       1      1      1      1      1
2:    fl   MI 33133       1       0      1      0      1      0
3:    fl   MI 33133       0       0      0      0      0      0
4:    fl   MI 33133       0       0      0      0      0      1
5:    fl   MI 33146       0       0      0      0      1      0
6:    fl   MI 33146       0       1      0      1      0      1
7:    oh   MI 45056       1       1      1      1      1      1

But as @Roland commented, there's probably a better way to achieve your ultimate goal.

And, as requested, here's what should be a faster solution using .I and .N :

cols = c("state","city","code")
for (i in seq_along(cols)) {
  w = ex[,list(f=.I[1],l=.I[.N]),by=eval(head(cols,i))]
  ex[,paste0(c("f.","l."),cols[i]):=0L]  # add the two 0 columns
  ex[w$f,paste0("f.",cols[i]):=1L]       # mark the firsts
  ex[w$l,paste0("l.",cols[i]):=1L]       # mark the lasts
}

It should be faster because the grouping is done just once per column, and lots of small vectors are not created (no call to c() or rep() for each group) unlike the first solution.

Upvotes: 5

Andrie
Andrie

Reputation: 179448

It's not entirely clear what you want, but you can certainly have more than one column in the index:

ex[, list(first=head(code, 1), last=tail(code, 1)), by=c("state", "city")]
   state city first  last
1:    az   TU 85730 85730
2:    fl   MI 33133 33146
3:    oh   MI 45056 45056

You can automate this over your groups like this:

by <- c("state", "city", "code")
byList <- lapply(seq_along(by), function(i)by[sequence(i)])
lapply(byList, 
       function(i) ex[, list(first=head(code, 1), last=tail(code, 1)), by=i] )

[[1]]
   state first  last
1:    az 85730 85730
2:    fl 33133 33146
3:    oh 45056 45056

[[2]]
   state city first  last
1:    az   TU 85730 85730
2:    fl   MI 33133 33146
3:    oh   MI 45056 45056

[[3]]
   state city  code first  last
1:    az   TU 85730 85730 85730
2:    fl   MI 33133 33133 33133
3:    fl   MI 33146 33146 33146
4:    oh   MI 45056 45056 45056

Upvotes: 2

Related Questions