Sarah Tomori
Sarah Tomori

Reputation: 13

Trouble calculating RMSE in R

I am currently working on a data science project based on the MovieLens, the Netflix data.

I have split the test- and training set like so:

# Test set will be 10% of current MovieLens data
set.seed(1, sample.kind="Rounding")
# if using R 3.5 or earlier, use `set.seed(1)` instead
test_index2 <- createDataPartition(y = edx$rating, times = 1, p = 0.1, list = FALSE)
train_set <- edx[-test_index2,]
test_set <- edx[test_index2,]

I have to calculate RMSE for the predicted ratings based on this function:

#Define the function that calculates RMSE
RMSE <- function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2))
}

First, I do this with the simplest model, which looks like this:

#Get mu_hat with the simplest model
mu_hat <- mean(train_set$rating)
mu_hat
[1] 3.512457

#Predict the known ratings with mu_hat
naive_rmse <- RMSE(test_set$rating, mu_hat)
naive_rmse
[1] 1.060056

#Create the results table
rmse_results <- tibble(method = "Simple average model", RMSE = naive_rmse)

Next, I need to use a model that penalizes for the movie effects:

#Penalize movie effects and adjust the mean
b_i <- train_set %>% group_by(movieId) %>%
summarize(b_i = sum(rating - mu_hat)/(n() + 1))

#Save and plot the movie averages with the movie effect model
movie_effect_avgs <- train_set %>% group_by(movieId) %>% summarize(b_i = mean(rating - mu_hat))
movie_effect_avgs %>% qplot(b_i, geom = "histogram", bins = 10, data = ., color = I("azure3"), xlab = "Number of movies with b_i", ylab = "Number of movies")

#Save the new predicted ratings
predicted_ratings <- mu_hat + test_set %>% left_join(movie_effect_avgs, by='movieId') %>%
pull(b_i)

The first line of the predicted ratings look like this:

predicted_ratings
   [1] 3.130763 4.221028 3.742687 3.429529 3.999581 4.278903 3.167818 3.332393

My problem occurs here:

#Calculate the RMSE for the movie effect model
movie_effect_rmse <- RMSE(predicted_ratings, test_set$rating)
movie_effect_rmse
[1] NA

It simply says "NA" instead of giving me a value of RMSE for the second model, but I cannot grasp what is wrong with my code or why the RMSE function doesn't work. I'm suspecting it has something to do with the structure of the test/training set. The code works if I follow the exact same steps stated above but instead, I take the dataset from before I had done the further split into the test and training (called edx), train on that dataset and use it directly on the validation set. However, this is not allowed according to the instructions for the project.

Any suggestions on what could be wrong?

Upvotes: 0

Views: 566

Answers (1)

Fnguyen
Fnguyen

Reputation: 1177

Just to codify this as an answer. Functions that produce NA do so because some of the inputs are already NA.

In case of most casual metrics like sum,mean,sd,etc. simply adding na.rm = TRUE as a function parameter works.

In your case

mean(x,na.rm= TRUE)

Upvotes: 1

Related Questions