Reputation: 13123
I wanted to randomly sample a data set without replacement and thought it would be easy. Unfortunately for me it was not and I could not locate R code on the internet to do it. Eventually I got this code to work. It seems overly complex, but it does seem to work.
set.seed(1234)
n.samples <- 10
my.grid <- read.table(text = '
state county y2000 y2001 y2002 y2003 y2004 y2005 y2006
A A 5 10 15 20 25 30 35
A B 15 20 25 30 35 40 45
A C 45 40 35 30 25 20 15
A Q 1 2 3 4 5 6 7
B A 9 8 7 6 5 4 3
B B 90 91 92 93 94 95 96
B G 10 20 30 40 50 60 70
B H 100 200 300 400 500 600 700
C J 900 850 800 750 700 650 600
C K 2 4 6 8 10 12 14
C M 3 6 9 12 15 18 21
C P 50 45 40 35 30 25 20
', header = TRUE)
my.grid
population <- expand.grid(row = c(seq(1,nrow(my.grid))),
col = c(seq(3,ncol(my.grid))))
rows <- seq(1, nrow(population))
sample <- sample(rows, n.samples, replace=FALSE)
use.these <- population[sample,]
use.these
measurement <- rep(NA, nrow(use.these))
my.area <- my.grid[use.these[,1], c(1:2)]
my.year <- names(my.grid)[use.these[,2]]
for(i in 1:nrow(use.these)) {
measurement[i] <- my.grid[use.these[i,1], use.these[i,2]]
}
my.samples <- data.frame(use.these, my.area, my.year, measurement)
my.samples
Output for my.samples
:
row col state county my.year measurement
10 10 3 C K y2000 2
52 4 7 A Q y2004 5
50 2 7 A B y2004 35
51 3 7 A C y2004 25
69 9 8 C J y2005 650
81 9 9 C J y2006 600
1 1 3 A A y2000 5
18 6 4 B B y2001 91
79 7 9 B G y2006 70
39 3 6 A C y2003 30
Is there a better way, particularly in base? I have heard of the sampling
package. Since my code seems to work and I am only asking for possible better approaches perhaps I should not post this here, although it seems like a common and important topic. If this is not an appropriate post I can remove it and place the code on my Wikipedia users page. Thank you for any suggestions.
Upvotes: 2
Views: 821
Reputation: 13123
When I attempted to use DWin's answer I realized that it returned the correct measurement
, but it did not seem to return the correct state
, county
or year
. I modified DWin's code as follows and it seems to return the same answers as the code in my original post. I debated with myself over whether to write a comment or post a second answer. I can delete this answer if others deem it appropriate after DWin reviews it.
set.seed(1234)
n.samples <- 10
my.grid <- read.table(text = '
state county y2000 y2001 y2002 y2003 y2004 y2005 y2006
A A 5 10 15 20 25 30 35
A B 15 20 25 30 35 40 45
A C 45 40 35 30 25 20 15
A Q 1 2 3 4 5 6 7
B A 9 8 7 6 5 4 3
B B 90 91 92 93 94 95 96
B G 10 20 30 40 50 60 70
B H 100 200 300 400 500 600 700
C J 900 850 800 750 700 650 600
C K 2 4 6 8 10 12 14
C M 3 6 9 12 15 18 21
C P 50 45 40 35 30 25 20
', header = TRUE)
my.grid
mat <- as.matrix(my.grid[,3:ncol(my.grid)])
mat
size <- length(mat)
picks <- sample(size, n.samples)
picks
# [1] 10 52 50 51 69 81 1 18 79 39
my.column <- 2 + (1 + (picks %/% (nrow(my.grid))))
my.row <- picks - (picks %/% nrow(mat)) * nrow(mat)
my.samples2 <- cbind(my.row, my.column, my.grid[my.row, 1:2], names(my.grid)[my.column], mat[picks])
names(my.samples2) <- c('row','column','state','county','year','measurement')
my.samples2
Gives:
row column state county year measurement
10 10 3 C K y2000 2
4 4 7 A Q y2004 5
2 2 7 A B y2004 35
3 3 7 A C y2004 25
9 9 8 C J y2005 650
9.1 9 9 C J y2006 600
1 1 3 A A y2000 5
6 6 4 B B y2001 91
7 7 9 B G y2006 70
3.1 3 6 A C y2003 30
Upvotes: 1
Reputation: 263481
Convert the years columns to a matrix; sample from row * col ( = length ) as an index. Use 1+(idx %/% nrow(.)) as the indices into the rows of the original and 1 + (idx %% ncol(.)) as the index into the year names. Then you will not need "no steenking loops". This is a really good question to perhaps get you beyond the loop mindset.
set.seed(1234)
n.samples <- 10
size <- length(mat); picks <- sample(size, n.samples)
picks
# [1] 10 52 50 51 69 81 1 18 79 39
cbind(my.grid[ 1+(picks %/% nrow(my.grid) ), 1:2] ,
names(my.grid)[-(1:2)][1+(picks %% 7)],
mat[picks])
#---------------------------------------
state county names(my.grid)[-(1:2)][1 + (picks%%7)] mat[picks]
1 A A y2003 2
5 B A y2003 5
5.1 B A y2001 35
5.2 B A y2002 25
6 B B y2006 650
7 B G y2004 600
1.1 A A y2001 5
2 A B y2004 91
7.1 B G y2002 70
4 A Q y2004 30
Upvotes: 2