Find first/last row "recursively" by group

Question

I try to find a efficient way of finding the first and last line by group.

R) ex=data.table(state=c("az","fl","fl","fl","fl","fl","oh"),city=c("TU","MI","MI","MI","MI","MI","MI"),code=c(85730,33133,33133,33133,33146,33146,45056))
R) ex
   state city  code
1:    az   TU 85730           
2:    fl   MI 33133           
3:    fl   MI 33133           
4:    fl   MI 33133           
5:    fl   MI 33146           
6:    fl   MI 33146           
7:    oh   MI 45056

I would like to find the first and last for each variable of a group

R) ex
   state city  code first.state last.state first.city last.city first.code last.code
1:    az   TU 85730           1          1          1         1          1         1
2:    fl   MI 33133           1          0          1         0          1         0
3:    fl   MI 33133           0          0          0         0          0         0
4:    fl   MI 33133           0          0          0         0          0         1
5:    fl   MI 33146           0          0          0         0          1         0
6:    fl   MI 33146           0          1          0         1          0         1
7:    oh   MI 45056           1          1          1         1          1         1

As far as I know data.table cannot easily help for things like this because by="state,city,code" would look at 4 triplets.

The only way I know would be to look for first/last.code in a by="state,city,code" then first/last.city in a by="state,city".

This is what I meant:

applyAll <- function(DT, by){
    f<- function(n, vec){ return(vec[1:n]) }
    by <- lapply(1:length(by), FUN=f, by)
    out <- Reduce(f=firstLast, init=DT, x=by)
    return(out)
}
firstLast <- function(DT, by){
    addNames <- paste(c("first", "last"),by[length(by)], sep=".")
    DT[DT[,list(IDX=.I[1]), by=by]$IDX, addNames[1]:=1]
    DT[DT[,list(IDX=.I[.N]), by=by]$IDX, addNames[2]:=1]
    return(DT);
}

Result by: applyAll(ex,c("state","city","code")) but this would make NUMEROUS copies of DT, my question is then, is there someting scheduled or already existing such that we cant get first/last by groups. (This is fairly vanilla for SAS or kdb or SQL)

In SAS:

data DT;
    set ex;
    by state city code;
    if first.code then firstcode=1;
    if last.code then lastcode=1;
    if first.city then firstcity=1;
    if last.city then lastcity=1;
    if first.state then firststate=1;
    if last.state then laststate=1;
run;

Matt Dowle · Accepted Answer

If this is the question :

For a set of columns (x,y,z) I'd like to add an integer column marking the position of the first item of each group by="x", by="x,y" and by="x,y,z" (three new columns). The 1st row of each new column will always be 1 because that's always the first item of the first group. I'd also like to add a further 3 columns marking the last item by each of the same 3 groupings. I might have many more than just 3 groupings, though, so is something programatic possible please?

then how about :

ex=data.table(state=c("az","fl","fl","fl","fl","fl","oh"),
              city=c("TU","MI","MI","MI","MI","MI","MI"),
              code=c(85730,33133,33133,33133,33146,33146,45056))
ex
   state city  code
1:    az   TU 85730
2:    fl   MI 33133
3:    fl   MI 33133
4:    fl   MI 33133
5:    fl   MI 33146
6:    fl   MI 33146
7:    oh   MI 45056

cols = c("state","city","code")
for (i in seq_along(cols)) {
  ex[,paste0("f.",cols[i]):=c(1L,rep(0L,.N-1L)),by=eval(head(cols,i))] # first
  ex[,paste0("l.",cols[i]):=c(rep(0L,.N-1L),1L),by=eval(head(cols,i))] # last
}
ex
   state city  code f.state l.state f.city l.city f.code l.code
1:    az   TU 85730       1       1      1      1      1      1
2:    fl   MI 33133       1       0      1      0      1      0
3:    fl   MI 33133       0       0      0      0      0      0
4:    fl   MI 33133       0       0      0      0      0      1
5:    fl   MI 33146       0       0      0      0      1      0
6:    fl   MI 33146       0       1      0      1      0      1
7:    oh   MI 45056       1       1      1      1      1      1

But as @Roland commented, there's probably a better way to achieve your ultimate goal.

And, as requested, here's what should be a faster solution using .I and .N :

cols = c("state","city","code")
for (i in seq_along(cols)) {
  w = ex[,list(f=.I[1],l=.I[.N]),by=eval(head(cols,i))]
  ex[,paste0(c("f.","l."),cols[i]):=0L]  # add the two 0 columns
  ex[w$f,paste0("f.",cols[i]):=1L]       # mark the firsts
  ex[w$l,paste0("l.",cols[i]):=1L]       # mark the lasts
}

It should be faster because the grouping is done just once per column, and lots of small vectors are not created (no call to c() or rep() for each group) unlike the first solution.

Find first/last row "recursively" by group

Answers (2)

Related Questions

Find first/last row &quot;recursively&quot; by group

Answers (2)

Related Questions

Find first/last row "recursively" by group