Reputation: 14370
I try to find a efficient way of finding the first and last line by group.
R) ex=data.table(state=c("az","fl","fl","fl","fl","fl","oh"),city=c("TU","MI","MI","MI","MI","MI","MI"),code=c(85730,33133,33133,33133,33146,33146,45056))
R) ex
state city code
1: az TU 85730
2: fl MI 33133
3: fl MI 33133
4: fl MI 33133
5: fl MI 33146
6: fl MI 33146
7: oh MI 45056
I would like to find the first and last for each variable of a group
R) ex
state city code first.state last.state first.city last.city first.code last.code
1: az TU 85730 1 1 1 1 1 1
2: fl MI 33133 1 0 1 0 1 0
3: fl MI 33133 0 0 0 0 0 0
4: fl MI 33133 0 0 0 0 0 1
5: fl MI 33146 0 0 0 0 1 0
6: fl MI 33146 0 1 0 1 0 1
7: oh MI 45056 1 1 1 1 1 1
As far as I know data.table
cannot easily help for things like this because by="state,city,code"
would look at 4
triplets.
The only way I know would be to look for first/last.code in a by="state,city,code" then first/last.city in a by="state,city".
This is what I meant:
applyAll <- function(DT, by){
f<- function(n, vec){ return(vec[1:n]) }
by <- lapply(1:length(by), FUN=f, by)
out <- Reduce(f=firstLast, init=DT, x=by)
return(out)
}
firstLast <- function(DT, by){
addNames <- paste(c("first", "last"),by[length(by)], sep=".")
DT[DT[,list(IDX=.I[1]), by=by]$IDX, addNames[1]:=1]
DT[DT[,list(IDX=.I[.N]), by=by]$IDX, addNames[2]:=1]
return(DT);
}
Result by: applyAll(ex,c("state","city","code"))
but this would make NUMEROUS copies of DT
, my question is then, is there someting scheduled or already existing such that we cant get first/last by groups. (This is fairly vanilla for SAS
or kdb
or SQL
)
In SAS
:
data DT;
set ex;
by state city code;
if first.code then firstcode=1;
if last.code then lastcode=1;
if first.city then firstcity=1;
if last.city then lastcity=1;
if first.state then firststate=1;
if last.state then laststate=1;
run;
Upvotes: 1
Views: 588
Reputation: 59612
If this is the question :
For a set of columns (x,y,z) I'd like to add an integer column marking the position of the first item of each group
by="x"
,by="x,y"
andby="x,y,z"
(three new columns). The 1st row of each new column will always be 1 because that's always the first item of the first group. I'd also like to add a further 3 columns marking the last item by each of the same 3 groupings. I might have many more than just 3 groupings, though, so is something programatic possible please?
then how about :
ex=data.table(state=c("az","fl","fl","fl","fl","fl","oh"),
city=c("TU","MI","MI","MI","MI","MI","MI"),
code=c(85730,33133,33133,33133,33146,33146,45056))
ex
state city code
1: az TU 85730
2: fl MI 33133
3: fl MI 33133
4: fl MI 33133
5: fl MI 33146
6: fl MI 33146
7: oh MI 45056
cols = c("state","city","code")
for (i in seq_along(cols)) {
ex[,paste0("f.",cols[i]):=c(1L,rep(0L,.N-1L)),by=eval(head(cols,i))] # first
ex[,paste0("l.",cols[i]):=c(rep(0L,.N-1L),1L),by=eval(head(cols,i))] # last
}
ex
state city code f.state l.state f.city l.city f.code l.code
1: az TU 85730 1 1 1 1 1 1
2: fl MI 33133 1 0 1 0 1 0
3: fl MI 33133 0 0 0 0 0 0
4: fl MI 33133 0 0 0 0 0 1
5: fl MI 33146 0 0 0 0 1 0
6: fl MI 33146 0 1 0 1 0 1
7: oh MI 45056 1 1 1 1 1 1
But as @Roland commented, there's probably a better way to achieve your ultimate goal.
And, as requested, here's what should be a faster solution using .I
and .N
:
cols = c("state","city","code")
for (i in seq_along(cols)) {
w = ex[,list(f=.I[1],l=.I[.N]),by=eval(head(cols,i))]
ex[,paste0(c("f.","l."),cols[i]):=0L] # add the two 0 columns
ex[w$f,paste0("f.",cols[i]):=1L] # mark the firsts
ex[w$l,paste0("l.",cols[i]):=1L] # mark the lasts
}
It should be faster because the grouping is done just once per column, and lots of small vectors are not created (no call to c()
or rep()
for each group) unlike the first solution.
Upvotes: 5
Reputation: 179448
It's not entirely clear what you want, but you can certainly have more than one column in the index:
ex[, list(first=head(code, 1), last=tail(code, 1)), by=c("state", "city")]
state city first last
1: az TU 85730 85730
2: fl MI 33133 33146
3: oh MI 45056 45056
You can automate this over your groups like this:
by <- c("state", "city", "code")
byList <- lapply(seq_along(by), function(i)by[sequence(i)])
lapply(byList,
function(i) ex[, list(first=head(code, 1), last=tail(code, 1)), by=i] )
[[1]]
state first last
1: az 85730 85730
2: fl 33133 33146
3: oh 45056 45056
[[2]]
state city first last
1: az TU 85730 85730
2: fl MI 33133 33146
3: oh MI 45056 45056
[[3]]
state city code first last
1: az TU 85730 85730 85730
2: fl MI 33133 33133 33133
3: fl MI 33146 33146 33146
4: oh MI 45056 45056 45056
Upvotes: 2