lts
lts

Reputation: 1

Random sample a percentage of rows without repetition in R

I have population data with age and gender characteristics, and I'm trying to populate another column with employment type based on other data have. I've used 'sample' to select a sample of of the population who work part time and then I will add this data as a new column, but I have yet to figure out how to ensure those selected are not reselected in the next sample for a different employment type.

At the moment I have the following which is for 23% of Male in a certain age group:

PT=my.df[sample(which(my.df$Age=="15" & my.df$Gender=="Male"), round(0.23*length (which(my.df$Age=="15" & my.df$Gender=="Male")))),]

And an example of my output looks like this:

         Edinburgh.ID    Age    Gender
2445         2445        15      Male
2477         2477        15      Male
2469         2469        15      Male
2485         2485        15      Male
2487         2487        15      Male
2483         2483        15      Male

I now want to select the next x% from the same age and gender group who have a different employment type. If I just change the 0.23 to another percentage, in some cases, the same IDs are coming out but I want individual IDs in each sample.

Upvotes: 0

Views: 4860

Answers (2)

Duf59
Duf59

Reputation: 532

You could define a data.frame describing the employment statistics for a given group and sample from it. Here is an approach in base R.

# Generate some data
N = 1000
my.df <- data.frame(Age = rep("15", N),
                    Gender = sample(c("Male", "Female"), N, TRUE),
                    Activity = rep("", N),
                    stringsAsFactors = FALSE)
head(my.df)
# Age Gender Activity
# 1  15 Female         
# 2  15   Male         
# 3  15   Male         
# 4  15 Female         
# 5  15   Male         
# 6  15 Female        

# employment statistics for the group age = "15" and gender = "Male"
employment <- data.frame(activity = letters[1:5],
                         prob = c(0.1, 0.1, 0.2, 0.5, 0.1),
                         stringsAsFactors = FALSE)
employment
# activity prob
# 1        a  0.1
# 2        b  0.1
# 3        c  0.2
# 4        d  0.5
# 5        e  0.1

# Assign activities
set.seed(35)
id   <- which(my.df$Age == "15" & my.df$Gender == "Male")
my.df[id, "Activity"] <- sample(employment$activity, length(id),
                      replace = TRUE, prob =  employment$prob)

table(my.df[my.df$Gender=="Male", "Activity"])/length(id)
# a         b         c         d         e 
# 0.1135903 0.1054767 0.1805274 0.4665314 0.1338742 

Upvotes: 0

gented
gented

Reputation: 1687

The dplyr package gives the possibility to randomly sample in percentage with(out) replacement.

library('dplyr')
sample_frac(df, size = percentage, replace = FALSE)

then you can adjust your constraints on age and gender accordingly.

Upvotes: 2

Related Questions