add-semi-colons
add-semi-colons

Reputation: 18830

R replace empty column of DF with random categorical value

Trying to replace demographic values by assigning them randomly.

I can obtain empty gender data rows by carrying out following:

df$gender[df$gender == "",]

user_id, name, age, gender
001, xyz, 23,  
004, abc, 32, 

I want to assign gender randomly:

sample(c('male', 'female'), nrow(df$gender[df$gender == ""]), replace=TRUE, prob=c(0.5, 0.5))

tried following:

df$gender[df$gender == ""] <- sample(c('male', 'female'), nrow(df$gender[df$gender == ""]), replace=TRUE, prob=c(0.5, 0.5))

This only assigned to few cells but not all.

Upvotes: 3

Views: 569

Answers (2)

Sastibe
Sastibe

Reputation: 248

Using the following example:

user_id <- c(1:5)
name <- c("a","b","c","d","e")
age <- c(20,23,44,21,32)
gender <- c("m","f","","", "m")

df <- data.frame(user_id,
                 name,
                 age,
                 gender,
                 stringsAsFactors = FALSE)

I suggest creating a vector of length nrow:

rand_gender <- sample(c('m', 'f'), nrow(df), replace=TRUE, prob=c(0.5, 0.5))

And only replacing in case "gender" is empty:

df$gender <- ifelse(df$gender=="", rand_gender, df$gender)

Upvotes: 3

Matt W.
Matt W.

Reputation: 3722

You should use length. df$gender[df$gender == ""] returns a vector since you're subsetting df$gender. You also don't need probs = c(0.5, 0.5) as sample by default will use 50/50 since you're only giving it two options. You would use probs if you wanted it to be a 70/30 split for male/female.

df$gender[df$gender == ""] <- sample(c('male', 'female'), length(df$gender[df$gender == ""]), replace=TRUE)

Upvotes: 1

Related Questions