Reputation: 55
I've been practicing with the Titanic dataset and have made steady progress. However, I have got stuck when I try to to replace the missing 'Age' variables with the median age depending upon certain conditions. Here is the dataset:
| Pclass | Name | Sex | Age |
|:------:|:---------:|:------:|:---:|
| 2 | officer | male | NA |
| 3 | mr | male | 27 |
| 3 | miss | female | NA |
Now, I want to replace the NAs with the median ages that i calculated and leave the age already present as it is. For this I used the following code for iteration:
age_fill <- function(x){
for (i in length(x$Age)) {
if (!is.na(x$Age[i])) {
return(x$Age[i])
}
else if(is.na(x$Age[i])){
if (x$Sex[i] == "female" && x$Pclass[i] == "3" && x$Name[i] == "miss"){
x$Age[i] = 18
}
if (x$Sex[i] == "male" && x$Pclass[i] == "2" && x$Name[i] == "mr"){
x$Age[i] = 29
}
if (x$Sex[i] == "male" && x$Pclass[i] == "3" && x$Name[i] == "officer"){
x$Age[i] = 25
}
}
}
return(x)
}
The problem here is nothing changes when i run the code as a function or in a loop. However, if I run it separately by putting in the digits of the row, it returns the results just fine.
Can someone pls tell me what I'm doing wrong?
Upvotes: 0
Views: 44
Reputation: 16988
Regarding your function there are several issues:
age_fill <- function(x){
for (i in length(x$Age)) {
if (!is.na(x$Age[i])) {
return(x$Age[i])
}
# some more code
}
Your for
-loop just loops over one element: length(x$Age)
returns one value. I guess you mistook it for 1:length(x$Age)
.
If your function encounters a non-NA
value, return(x$Age[i])
will break/stop your function and return one value. I don't think that is what you want. In the case of a non-NA
value you want your function not to change anything. Therefore you should remove this whole part:
if (!is.na(x$Age[i])) {
return(x$Age[i])
}
else
Your condition
if(is.na(x$Age[i])){
# enter code here
}
is sufficent.
Here is a solution using dplyr
. It's not a direct answer to your question but I want to show you another approach to your problem. Given a dataset
> df
# A tibble: 6 x 4
Pclass Name Sex Age
<dbl> <chr> <chr> <dbl>
1 2 officer male NA
2 3 mr male 27
3 3 miss female NA
4 3 mr male NA
5 2 mr male NA
6 3 officer male NA
that I created with package readr
df <- read_table2("Pclass Name Sex Age
2 officer male NA
3 mr male 27
3 miss female NA
3 mr male NA
2 mr male NA
3 officer male NA")
Now we use mutate
combined with case_when
df %>%
mutate(Age = case_when(!is.na(Age) ~ Age,
Sex == "male" & Pclass == "3" & Name == "officer" ~ 25,
Sex == "male" & Pclass == "2" & Name == "mr" ~ 29,
Sex == "female" & Pclass == "3" & Name == "miss" ~ 18
))
which yields
# A tibble: 6 x 4
Pclass Name Sex Age
<dbl> <chr> <chr> <dbl>
1 2 officer male NA
2 3 mr male 27
3 3 miss female 18
4 3 mr male NA
5 2 mr male 29
6 3 officer male 25
Using this approach you don't need a function nor any kind of loop and your conditions are clearly arranged. As a rule of thumb: Try avoiding loops. Usally there are more sophisticated ways of performing a task without loops. R uses "hidden loops" inside functions optimized for performance. However there are tasks well suited for loops. So the decision depends on the actual task.
Upvotes: 3
Reputation: 306
I think that the function has a parameter x and returns x, but the for cycle is aplied to a (I guess) data.frame "comb". In order to perform a call to the function
output <- age_fill(comb)
you should replace comb$myVariable
by x$myVariable
so that all the operation within the for cycle can be applied.
Upvotes: 1