Reputation: 307
I am trying to extract the first record for each combination of SUBJ
, BLK
and TR
from a dataset. The simplified set, sampleData, is:
SUBJ BLK TR BEG END
1 1234 1 1 111021 111021
2 1234 1 1 111400 111021
3 1234 1 1 111566 111021
4 1234 1 1 111765 111021
5 1234 2 2 132050 133113
6 1234 2 2 133123 133113
7 1234 2 2 133479 133113
8 1234 2 2 133762 133113
9 5678 1 1 82503 82502
10 5678 1 1 82902 82502
11 5678 1 1 83102 82502
12 5678 1 1 83310 82502
13 5678 2 2 274870 288224
14 5678 2 2 288225 288224
15 5678 2 2 288535 288224
16 5678 2 2 288802 288224
When I try this:
ddplyFirst <- ddply(sampleData, .(SUBJ, BLK, TR), summarize,
Tr.match = match(unique(TR), TR))
I get this:
ddplyFirst
SUBJ BLK TR Tr.match
1 1234 1 1 1
2 1234 2 2 1
3 5678 1 1 1
4 5678 2 2 1
I don't know how to go from this to what I want to get, which would include either the BEG
and END
values or BEG - END
.
now, it turns out that in the above example, END
is a unique number so I can do this:
first <- with(sampleData, match(unique(END), END))
which gives me:
sampleData[first,]
SUBJ BLK TR BEG END
1 1234 1 1 111021 111021
5 1234 2 2 132050 133113
9 5678 1 1 82503 82502
13 5678 2 2 274870 288224
The problem is that the complete dataset is 202,616 records long and I cannot guarantee that BEG
or END
have unique values for different combinations of SUBJ
, BLK
, and TR
.
Also, I would like to learn how to solve the more general case, which for me right now would be obtaining the first record in each SUBJ
, BLK
, and TR
combination.
Of course, the even more general case would be to obtain the nth record in each SUBJ
, BLK
, and TR
combo. If anyone can show me how to do one or the other of these "more" general or "most" general solutions, I would be vastly appreciative.
Upvotes: 5
Views: 5145
Reputation: 115392
A data.table
solution for coding elegance (and general efficiency)
If you are getting any row other than the first (and there may be differing numbers rows in each sub group), then you will need to consider what happens when you are trying to get a row that doesn't exist.
The solution below will get the min(`n`, `total#of rows in each group`))
for each group
library(data.table)
DT <- data.table(sampleData)
# get the the row you want. This may
#
index <- 2L
DT[,{ idx <- min(index, .N); .SD[idx,]} ,by =list(SUBJ, BLK, TR)]
Upvotes: 1
Reputation: 13363
In the specific case (getting the first or last row) you can use the function duplicated
:
sampleData[!duplicated(sampleData[,c("SUBJ","BLK","TR")],fromLast=FALSE),]
Upvotes: 1
Reputation: 42649
Here is an aggregate
solution for the general case:
aggregate(. ~ SUBJ+BLK+TR, data=sampleData, FUN='[', 1)
## SUBJ BLK TR BEG END
## 1 1234 1 1 111021 111021
## 2 5678 1 1 82503 82502
## 3 1234 2 2 132050 133113
## 4 5678 2 2 274870 288224
aggregate(. ~ SUBJ+BLK+TR, data=sampleData, FUN='[', 2)
## SUBJ BLK TR BEG END
## 1 1234 1 1 111400 111021
## 2 5678 1 1 82902 82502
## 3 1234 2 2 133123 133113
## 4 5678 2 2 288225 288224
Upvotes: 0
Reputation: 121568
No need to use summarise
here, you can do this (using head
to get the first row)
ddply(sampleData, .(SUBJ, BLK, TR), function(x) head(x,1))
SUBJ BLK TR BEG END
1 1234 1 1 111021 111021
2 1234 2 2 132050 133113
3 5678 1 1 82503 82502
4 5678 2 2 274870 288224
Or more general to get the nth row.n you can do :
ddply(sampleData, .(SUBJ, BLK, TR), function(x) x[min(row.n,nrow(x),])
Upvotes: 3
Reputation: 118799
You can access 1st or nth record by just indexing it as such (assuming nth
index exists for each group):
idx <- 2
ddply(sampleData, .(SUBJ, BLK, TR), summarise, BEG=BEG[idx], END=END[idx])
# SUBJ BLK TR BEG END
# 1 1234 1 1 111400 111021
# 2 1234 2 2 133123 133113
# 3 5678 1 1 82902 82502
# 4 5678 2 2 288225 288224
Upvotes: 2