WGray
WGray

Reputation: 307

R dataframe - how to extract unique values

I am trying to extract the first record for each combination of SUBJ, BLK and TR from a dataset. The simplified set, sampleData, is:

SUBJ BLK TR    BEG    END
1  1234   1  1 111021 111021
2  1234   1  1 111400 111021
3  1234   1  1 111566 111021
4  1234   1  1 111765 111021
5  1234   2  2 132050 133113
6  1234   2  2 133123 133113
7  1234   2  2 133479 133113
8  1234   2  2 133762 133113
9  5678   1  1  82503  82502
10 5678   1  1  82902  82502
11 5678   1  1  83102  82502
12 5678   1  1  83310  82502
13 5678   2  2 274870 288224
14 5678   2  2 288225 288224
15 5678   2  2 288535 288224
16 5678   2  2 288802 288224

When I try this:

ddplyFirst <- ddply(sampleData, .(SUBJ, BLK, TR), summarize, 
                        Tr.match = match(unique(TR), TR))

I get this:

ddplyFirst
  SUBJ BLK TR Tr.match
1 1234   1  1        1
2 1234   2  2        1
3 5678   1  1        1
4 5678   2  2        1

I don't know how to go from this to what I want to get, which would include either the BEG and END values or BEG - END.

now, it turns out that in the above example, END is a unique number so I can do this:

first <- with(sampleData, match(unique(END), END))

which gives me:

sampleData[first,]
   SUBJ BLK TR    BEG    END
1  1234   1  1 111021 111021
5  1234   2  2 132050 133113
9  5678   1  1  82503  82502
13 5678   2  2 274870 288224

The problem is that the complete dataset is 202,616 records long and I cannot guarantee that BEG or END have unique values for different combinations of SUBJ, BLK, and TR.

Also, I would like to learn how to solve the more general case, which for me right now would be obtaining the first record in each SUBJ, BLK, and TR combination.

Of course, the even more general case would be to obtain the nth record in each SUBJ, BLK, and TR combo. If anyone can show me how to do one or the other of these "more" general or "most" general solutions, I would be vastly appreciative.

Upvotes: 5

Views: 5145

Answers (5)

mnel
mnel

Reputation: 115392

A data.table solution for coding elegance (and general efficiency)

If you are getting any row other than the first (and there may be differing numbers rows in each sub group), then you will need to consider what happens when you are trying to get a row that doesn't exist.

The solution below will get the min(`n`, `total#of rows in each group`)) for each group

library(data.table)
DT <- data.table(sampleData)

# get the the row you want. This may 
#
index <- 2L
 DT[,{ idx <- min(index, .N); .SD[idx,]} ,by =list(SUBJ, BLK, TR)] 

Upvotes: 1

Blue Magister
Blue Magister

Reputation: 13363

In the specific case (getting the first or last row) you can use the function duplicated:

sampleData[!duplicated(sampleData[,c("SUBJ","BLK","TR")],fromLast=FALSE),]

Upvotes: 1

Matthew Lundberg
Matthew Lundberg

Reputation: 42649

Here is an aggregate solution for the general case:

aggregate(. ~ SUBJ+BLK+TR, data=sampleData, FUN='[', 1)
##   SUBJ BLK TR    BEG    END
## 1 1234   1  1 111021 111021
## 2 5678   1  1  82503  82502
## 3 1234   2  2 132050 133113
## 4 5678   2  2 274870 288224


aggregate(. ~ SUBJ+BLK+TR, data=sampleData, FUN='[', 2)
##   SUBJ BLK TR    BEG    END
## 1 1234   1  1 111400 111021
## 2 5678   1  1  82902  82502
## 3 1234   2  2 133123 133113
## 4 5678   2  2 288225 288224

Upvotes: 0

agstudy
agstudy

Reputation: 121568

No need to use summarise here, you can do this (using head to get the first row)

 ddply(sampleData, .(SUBJ, BLK, TR), function(x) head(x,1))
  SUBJ BLK TR    BEG    END
1 1234   1  1 111021 111021
2 1234   2  2 132050 133113
3 5678   1  1  82503  82502
4 5678   2  2 274870 288224

Or more general to get the nth row.n you can do :

ddply(sampleData, .(SUBJ, BLK, TR), function(x) x[min(row.n,nrow(x),])

Upvotes: 3

Arun
Arun

Reputation: 118799

You can access 1st or nth record by just indexing it as such (assuming nth index exists for each group):

idx <- 2
ddply(sampleData, .(SUBJ, BLK, TR), summarise, BEG=BEG[idx], END=END[idx])

#   SUBJ BLK TR    BEG    END
# 1 1234   1  1 111400 111021
# 2 1234   2  2 133123 133113
# 3 5678   1  1  82902  82502
# 4 5678   2  2 288225 288224

Upvotes: 2

Related Questions