Reputation: 10646
there is big files in a directory called c:/logs. I need to reach each log, grep for a few regular expressions and put the output to a data frame. This is what I have a and it is taking a very long time to run:
final<-data.frame()
path<-c("C:/logs")
logs<-dir(path, pattern = "log", full.names = TRUE, ignore.case = TRUE)
toMatch <- c("compile.cancelSwapCode", "compile.insertCode", "compile.getCode",
"compile.getCodes","compile.getCodeWithAnalysis")
for (i in logs){
print(i)
xx <- readLines(i)
xxx<-grep(paste(toMatch,collapse="|"), xx, value=TRUE)
df<-read.table(text=xxx)
final<-rbind(final, df)
}
The actual file sisez are 150MB. Is there a quick way to do this?
example log file:
2016-11-02 00:00:01,506 INFO [[(JSK) mux request dispatch][/][tid=1234][compileController.Code]] - Received request for test request: [ticket=test101]
2016-11-02 00:00:01,514 INFO [[(JSK) mux request dispatch][/][tid=1234][compileController.Code]] - request: [ticket=test101] found in cache, returning from Cache
2016-11-02 00:00:01,515 DEBUG [[(JSK) mux request dispatch][/][tid=1234][compileController.Code]] - compileController.Code finished in 9ms
2016-11-02 00:00:01,578 INFO [[(JSK) mux request dispatch][/][tid=2345][compileController.Code]] - Received request for test request: [ticket=test101]
2016-11-02 00:00:01,582 INFO [[(JSK) mux request dispatch][/][tid=2345][compileController.Code]] - request: [ticket=test101] found in cache, returning from Cache
2016-11-02 00:00:01,582 DEBUG [[(JSK) mux request dispatch][/][tid=2345][compileController.Code]] - compileController.Code finished in 4ms
2016-11-02 00:00:08,606 INFO [[(JSK) mux request dispatch][/][tid=6789][compileController.Code]] - Received request for test request: [ticket=test102]
2016-11-02 00:00:08,606 INFO [[(JSK) mux request dispatch][/][tid=6789][compileController.Code]] - request: [ticket=test102] found in cache, returning from Cache
2016-11-02 00:00:08,606 DEBUG [[(JSK) mux request dispatch][/][tid=6789][compileController.Code]] - compileController.Code finished in 0ms
2016-11-02 00:00:09,320 INFO [[(JSK) mux request dispatch][/][tid=566][compileController.Code]] - Received request for test request: [ticket=test102]
2016-11-02 00:00:09,320 INFO [[(JSK) mux request dispatch][/][tid=566][compileController.Code]] - request: [ticket=test102] found in cache, returning from Cache
Upvotes: 1
Views: 2256
Reputation: 18612
The major bottleneck in your code seems to be the call to read.table
. By first concatenating the vector xxx
into a single, newline-delimited string, you can pass it into data.tables's fread
, which is much faster. Additionally, I've replaced the call to grep
with stri_subset
from the stringi
package, and used data.table's rbindlist
to combine everything, rather than iteratively calling rbind
. I'm not sure how many files you have in total, but in the code below I'm using nine 180 MB files created from your sample.
library(data.table)
library(stringi)
files <- list.files(
"/tmp",
pattern = "logfile",
full.names = TRUE
)
re <- paste(
c("compile.cancelSwapCode", "compile.insertCode",
"compile.getCode", "compile.getCodes",
"compile.getCodeWithAnalysis", "DEBUG"),
collapse = "|"
)
system.time({
final <- rbindlist(
lapply(files, function(f) {
lines <- readLines(f)
fread(
paste0(
stri_subset(lines, regex = re),
collapse = "\n"
),
header = FALSE,
sep = " ",
colClasses = "character"
)
})
)
})
# user system elapsed
# 136.664 0.628 137.378
It looks like you are on Windows, so if you have Rtools installed and are able to leverage Rcpp, you can shave off some more time by replacing readLines
with a dead simple C++ equivalent:
#include <Rcpp.h>
#include <fstream>
// [[Rcpp::export]]
Rcpp::CharacterVector read_lines(const char* file) {
std::vector<std::string> res;
res.reserve(10000);
std::ifstream stream(file);
std::string tmp;
while (std::getline(stream, tmp)) {
res.push_back(tmp);
}
return Rcpp::wrap(res);
}
Replacing readLines
with read_lines
in the above, I get
# user system elapsed
# 39.200 0.376 39.596
If you aren't able to get up and running with Rtools and Rcpp, using read_lines
from the readr
package will still be faster than readLines
. Here is a comparison of all three, reading in those nine 180 MB files:
system.time({ lapply(files, readLines) })
# user system elapsed
# 287.136 1.140 288.471
system.time({ lapply(files, readr::read_lines) })
# user system elapsed
# 91.568 0.604 91.895
system.time({ lapply(files, read_lines) })
# user system elapsed
# 24.204 0.652 24.862
Update
Regarding you comment below, IIUC you can do something like this:
final <- rbindlist(
lapply(files, function(f) {
lines <- readLines(f)
matches <- stri_subset(lines, regex = re)
data.table(
stri_match_first(
matches,
regex = "\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}"
)
)
})
)[,.(Count = .N), by = "V1"]
Upvotes: 3