Reputation: 13
I have this final dataset of roughly 150 000 rows per 40 columns that covers all my potential samples from 1932 to 2016, and I need to make a random selection of 53 samples per year for a total number of ~5000.
The selection in itself is really straight forward using the sample() function to get a subset, however I need to display the selection in the original dataframe to be able to check various things. My issue is the following:
If I edit one of the fields in my random subset and merge it back with the main one, it creates duplicates that I can't remove because one field changed and thus R considers the two rows aren't duplicates. If I don't edit anything, I can't find which rows were selected.
My solution for now was to merge everything in Excel instead of R, apply color codes to highlight the selected rows and delete manually the duplicates. However it's time consuming, prone to mistakes and not practicable as the dataset seems to be too big and my PC quickly runs out of memory when I try...
UPDATE:
Here's a reproducible example:
dat <- data.frame(
X = sample(2000:2016, 50, replace=TRUE),
Y = sample(c("yes", "no"), 50, replace = TRUE),
Z = sample(c("french","german","english"), 50, replace=TRUE)
)
dat2 <- subset(dat, dat$X==2000) #samples of year 2000
sc <- dat2[sample(nrow(dat2), 1), ] #Random selection of 1
What I would like to do is select directly in the dataset (dat1), for example by randomly assigning the value "1" in a column called "selection". Or, if not possible, how can I merge the sampled rows (here called "sc") back to the main dataset but with something indicating they have been sampled
Note:
I've been using R sporadically for the last 2 years and I'm a fairly inexperienced user, so I apologize if this is a silly question. I've been roaming Google and SO for the last 3 days and couldn't find any relevant answer yet.
I recently got in a PhD program in biology that requires me to handle a lot of data from an archive.
Upvotes: 0
Views: 92
Reputation: 25395
EDIT: updated based on comments.
You could add a column that indicates if a row is part of your sample. So maybe try the following:
df = data.frame(year= c(1,1,1,1,1,1,2,2,2,2,2,2), id=c(1,2,3,4,5,6,7,8,9,10,11,12),age=c(7,7,7,12,12,12,7,7,7,12,12,12))
library(dplyr)
n_per_year_low_age = 2
n_per_year_high_age = 1
df <- df %>% group_by(year) %>%
mutate(in_sample1 = as.numeric(id %in% sample(id[age<8],n_per_year_low_age))) %>%
mutate(in_sample2 = as.numeric(id %in% sample(id[age>8],n_per_year_high_age))) %>%
mutate(in_sample = in_sample1+in_sample2) %>%
select(-in_sample1,-in_sample2)
Output:
# A tibble: 12 x 4
# Groups: year [2]
year id age in_sample
<dbl> <dbl> <dbl> <dbl>
1 1.00 1.00 7.00 1.00
2 1.00 2.00 7.00 1.00
3 1.00 3.00 7.00 0
4 1.00 4.00 12.0 1.00
5 1.00 5.00 12.0 0
6 1.00 6.00 12.0 0
7 2.00 7.00 7.00 1.00
8 2.00 8.00 7.00 0
9 2.00 9.00 7.00 1.00
10 2.00 10.0 12.0 0
11 2.00 11.0 12.0 0
12 2.00 12.0 12.0 1.00
Futher operations are then trivial:
# extracting your sample
df %>% filter(in_sample==1)
# comparing statistics of your sample against the rest of the population
df %>% group_by(year,in_sample) %>% summarize(mean(id))
Upvotes: 1