Reputation: 1187
I read a file in ratingsFile using
ratingsFile <- readLines("~/ratings.list",encoding = "UTF-8")
The file's first few lines looks like
0000000125 1478759 9.2 The Shawshank Redemption (1994)
0000000125 1014575 9.2 The Godfather (1972)
0000000124 683611 9.0 The Godfather: Part II (1974)
0000000124 1451861 8.9 The Dark Knight (2008)
0000000124 1150611 8.9 Pulp Fiction (1994)
0000000133 750978 8.9 Schindler's List (1993)
Using regular expression I extracted
match <- gregexpr("[0-9A-Za-z;'$%&?@./]+",ratingsFile)
match <- regmatches(ratingsFile,match)
next_match <- gregexpr("[0-9][.][0-9]+",ratingsFile)
next_match <- regmatches(ratingsFile,next_match)
The sample output of match looks like
"0000000125" "1014575" "9.2" "The" "Godfather" "1972"
For cleaning that data and changing to the form i need I did
movies_name <- character(0)
rating <- character(0)
for(i in 1:length(match)){
match[[i]]<-match[[i]][-1:-3] #for removing not need cols
len <- length(match[[i]])
match[[i]]<-match[[i]][-len]#removing last column also not needed
movies_name<-append(movies_name,paste(match[[i]],collapse =" "))
#appending movies name
rating <- append(rating,next_match[[i]])
#appending rating
}
Now this final block of code is taking too long to execute.I have left he compilation process for hours but still it is not completed as the file is 636497 lines long.
How can i reduce the compilation time in this case?
Upvotes: 1
Views: 123
Reputation: 24074
If I understand correctly what you want to do (only get the movie titles), here is another option to get what you want:
unlist(lapply(strsplit(ratingsFile, "\\s{2,}"), # split each line whenever there are at least 2 spaces
function(x){ # for each resulting vector
x <- gsub(" \\(\\d{4}\\)$", "", tail(x, 1)) # keep only the needed part (movie title)
x
}))
# [1] "The Shawshank Redemption" "The Godfather" "The Godfather: Part II" "The Dark Knight" "Pulp Fiction"
# [6] "Schindler's List"
NB: Note that you can put the resulting vector in a data.frame and/or keep the other information from the former lines.
Upvotes: 2
Reputation: 16968
If you want to find and use some data from your data, I think you can use this regex:
/^ *(\d*) *(\d*) *(\d+\.\d+)(.*)\((\d+)\)$/gm
with substitutions
Upvotes: 1
Reputation: 54247
Try this:
ratingsFile <- readLines(n = 6)
0000000125 1478759 9.2 The Shawshank Redemption (1994)
0000000125 1014575 9.2 The Godfather (1972)
0000000124 683611 9.0 The Godfather: Part II (1974)
0000000124 1451861 8.9 The Dark Knight (2008)
0000000124 1150611 8.9 Pulp Fiction (1994)
0000000133 750978 8.9 Schindler's List (1993)
setNames(as.data.frame(t(sapply(regmatches(ratingsFile, regexec("\\d{10}\\s+\\d+\\s+([0-9.]+)\\s+(.*?)\\s\\(\\d{4}\\)", ratingsFile)), "[", -1))), c("rating", "movie_name"))
# rating movie_name
# 1 9.2 The Shawshank Redemption
# 2 9.2 The Godfather
# 3 9.0 The Godfather: Part II
# 4 8.9 The Dark Knight
# 5 8.9 Pulp Fiction
# 6 8.9 Schindler's List
Upvotes: 2