Ritesh Jung Thapa
Ritesh Jung Thapa

Reputation: 1187

Regular expression taking too much time to compile in R

I read a file in ratingsFile using

ratingsFile <- readLines("~/ratings.list",encoding = "UTF-8")

The file's first few lines looks like

  0000000125  1478759   9.2  The Shawshank Redemption (1994)
  0000000125  1014575   9.2  The Godfather (1972)
  0000000124  683611   9.0  The Godfather: Part II (1974)
  0000000124  1451861   8.9  The Dark Knight (2008)
  0000000124  1150611   8.9  Pulp Fiction (1994)
  0000000133  750978   8.9  Schindler's List (1993)

Using regular expression I extracted

  match <- gregexpr("[0-9A-Za-z;'$%&?@./]+",ratingsFile)
  match <- regmatches(ratingsFile,match)


  next_match <- gregexpr("[0-9][.][0-9]+",ratingsFile)
  next_match <- regmatches(ratingsFile,next_match)

The sample output of match looks like

  "0000000125" "1014575"    "9.2"        "The"        "Godfather"  "1972"  

For cleaning that data and changing to the form i need I did

  movies_name <- character(0)
  rating <- character(0)
  for(i in 1:length(match)){

      match[[i]]<-match[[i]][-1:-3] #for removing not need cols 
      len <- length(match[[i]])
      match[[i]]<-match[[i]][-len]#removing last column also not needed
      movies_name<-append(movies_name,paste(match[[i]],collapse =" "))
      #appending movies name
      rating <- append(rating,next_match[[i]]) 
      #appending rating
}

Now this final block of code is taking too long to execute.I have left he compilation process for hours but still it is not completed as the file is 636497 lines long.

How can i reduce the compilation time in this case?

Upvotes: 1

Views: 123

Answers (3)

Cath
Cath

Reputation: 24074

If I understand correctly what you want to do (only get the movie titles), here is another option to get what you want:

unlist(lapply(strsplit(ratingsFile, "\\s{2,}"), # split each line whenever there are at least 2 spaces
                                 function(x){ # for each resulting vector
                                    x <- gsub(" \\(\\d{4}\\)$", "", tail(x, 1)) # keep only the needed part (movie title)
                                    x
                                 }))

# [1] "The Shawshank Redemption" "The Godfather"            "The Godfather: Part II"   "The Dark Knight"          "Pulp Fiction"            
# [6] "Schindler's List"

NB: Note that you can put the resulting vector in a data.frame and/or keep the other information from the former lines.

Upvotes: 2

shA.t
shA.t

Reputation: 16968

If you want to find and use some data from your data, I think you can use this regex:

/^ *(\d*) *(\d*) *(\d+\.\d+)(.*)\((\d+)\)$/gm

with substitutions

  • $1 => first column
  • $2 => second column
  • $3 => third column (maybe rating)
  • $4 => movie name
  • $5 => movie year

[Regex Demo]

Upvotes: 1

lukeA
lukeA

Reputation: 54247

Try this:

ratingsFile <- readLines(n = 6)
0000000125  1478759   9.2  The Shawshank Redemption (1994)
0000000125  1014575   9.2  The Godfather (1972)
0000000124  683611   9.0  The Godfather: Part II (1974)
0000000124  1451861   8.9  The Dark Knight (2008)
0000000124  1150611   8.9  Pulp Fiction (1994)
0000000133  750978   8.9  Schindler's List (1993)
setNames(as.data.frame(t(sapply(regmatches(ratingsFile, regexec("\\d{10}\\s+\\d+\\s+([0-9.]+)\\s+(.*?)\\s\\(\\d{4}\\)", ratingsFile)), "[", -1))), c("rating", "movie_name"))
#   rating               movie_name
# 1    9.2 The Shawshank Redemption
# 2    9.2            The Godfather
# 3    9.0   The Godfather: Part II
# 4    8.9          The Dark Knight
# 5    8.9             Pulp Fiction
# 6    8.9         Schindler's List

Upvotes: 2

Related Questions