Reputation: 1
I have population data with age and gender characteristics, and I'm trying to populate another column with employment type based on other data have. I've used 'sample' to select a sample of of the population who work part time and then I will add this data as a new column, but I have yet to figure out how to ensure those selected are not reselected in the next sample for a different employment type.
At the moment I have the following which is for 23% of Male in a certain age group:
PT=my.df[sample(which(my.df$Age=="15" & my.df$Gender=="Male"), round(0.23*length (which(my.df$Age=="15" & my.df$Gender=="Male")))),]
And an example of my output looks like this:
Edinburgh.ID Age Gender
2445 2445 15 Male
2477 2477 15 Male
2469 2469 15 Male
2485 2485 15 Male
2487 2487 15 Male
2483 2483 15 Male
I now want to select the next x% from the same age and gender group who have a different employment type. If I just change the 0.23 to another percentage, in some cases, the same IDs are coming out but I want individual IDs in each sample.
Upvotes: 0
Views: 4860
Reputation: 532
You could define a data.frame describing the employment statistics for a given group and sample from it. Here is an approach in base R.
# Generate some data
N = 1000
my.df <- data.frame(Age = rep("15", N),
Gender = sample(c("Male", "Female"), N, TRUE),
Activity = rep("", N),
stringsAsFactors = FALSE)
head(my.df)
# Age Gender Activity
# 1 15 Female
# 2 15 Male
# 3 15 Male
# 4 15 Female
# 5 15 Male
# 6 15 Female
# employment statistics for the group age = "15" and gender = "Male"
employment <- data.frame(activity = letters[1:5],
prob = c(0.1, 0.1, 0.2, 0.5, 0.1),
stringsAsFactors = FALSE)
employment
# activity prob
# 1 a 0.1
# 2 b 0.1
# 3 c 0.2
# 4 d 0.5
# 5 e 0.1
# Assign activities
set.seed(35)
id <- which(my.df$Age == "15" & my.df$Gender == "Male")
my.df[id, "Activity"] <- sample(employment$activity, length(id),
replace = TRUE, prob = employment$prob)
table(my.df[my.df$Gender=="Male", "Activity"])/length(id)
# a b c d e
# 0.1135903 0.1054767 0.1805274 0.4665314 0.1338742
Upvotes: 0
Reputation: 1687
The dplyr
package gives the possibility to randomly sample in percentage with(out) replacement.
library('dplyr')
sample_frac(df, size = percentage, replace = FALSE)
then you can adjust your constraints on age and gender accordingly.
Upvotes: 2