SJ Lin
SJ Lin

Reputation: 19

How to vectorize a for loop in R for a large dataset

I'm relatively new to R and I have a question about data processing. The main issue is that the dataset is too big, and I want to write a vectorized function that's faster than a for loop, but I don't know how. The data is about movies and user ratings, is formatted like this (below).

1:
5,3,2005-09-06
1,5,2005-05-13
3,4,2005-10-19

2:
2,4,2005-12-26
3,3,2004-05-03
5,3,2005-11-17

The 1: and 2: represent movies, while the other lines represent a user id, user rating and dating of rating for that movie (in that order from left to right, separated by commas). I want to format the data as an edge list, like this:

Movie | User
1:    | 5
1:    | 1
1:    | 3
2:    | 2
2:    | 3
2:    | 5

I wrote the code below to perform this function. Basically, for every row, it check if its a movie id (containing ':') or if it's user data. It then combines the movie id and user id as two columns for every movie and user, and then rowbinds it to a new data frame. At the same time, it also only binds those users who rate a movie 5 out of 5.

el <- data.frame(matrix(ncol = 2, nrow = 0))

for (i in 1:nrow(data))
{
  if (grepl(':', data[i,]))
  {
    mid <- data[i,]
  } else(grepl(',', data[i,]))
  {
    if(grepl(',5,', data[i,]))
    {
      uid <- unlist(strsplit(data[i,], ','))[1]
      add <- c(mid, uid)
      el <- rbind(el, add)
    }
  }
}

However, I have about 100 million entries, and the for loop runs throughout the night without being able to complete. Is there a way to speed this up? I read about vectorization, but I can't figure out how to vectorize this function. Any help?

Upvotes: 1

Views: 88

Answers (1)

David Robinson
David Robinson

Reputation: 78610

You can do this with a few regular expressions, for which I'll use the stringr package, as well as na.locf from the zoo package. (You'll have to install stringr and zoo first).

First we'll set up your data, which it sounds like is in a one-column data frame:

data <- read.table(textConnection("1:
5,3,2005-09-06
1,5,2005-05-13
3,4,2005-10-19

2:
2,4,2005-12-26
3,3,2004-05-03
5,3,2005-11-17
"))

You can then follow the following steps (explanation in comments).

# Pull out the column as a character vector for simplicity
lines <- data[[1]]

library(stringr)
# Figure out which lines represent movie IDs, and extract IDs
movie_ids <- str_match(lines, "(\\d+):")[, 2]

# Fill the last observation carried forward (locf), to find out
# the most recent non-NA value
library(zoo)
movie_ids_filled <- na.locf(movie_ids)

# Extract the user IDs
user_ids <- str_match(lines, "(\\d+),")[, 2]

# For each line that has a user ID, match it to the movie ID
result <- cbind(movie_ids_filled[!is.na(user_ids)],
                user_ids[!is.na(user_ids)])

This gets the result

     [,1] [,2]
[1,] "1"  "5" 
[2,] "1"  "1" 
[3,] "1"  "3" 
[4,] "2"  "2" 
[5,] "2"  "3" 
[6,] "2"  "5" 

The most important part of this code is the use of regular expressions, particularly the capturing groups in parentheses of "(\\d+):" and (\\d+),. For more on using str_match with regular expressions, do check out this guide.

Upvotes: 4

Related Questions