Reputation: 1503
There appear to be similar questions to this in other languages but I can't find one in R.
I have a number of text files in the subdirectories of a directory; they all have the extension (.log) and they contain a mixture of text and data. I want to extract a couple of lines from these relatively large files.
For example, one file goes as follows ...
blahblahblah
NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS = 210
blahblahblah
----------------------------------------<br />
CPU timing information for all processes<br />
========================================<br />
0: 8853.469 + 133.948 = 8987.417<br />
1: 8850.817 + 126.587 = 8977.405<br />
2: 8851.925 + 128.576 = 8980.501<br />
3: 8847.992 + 125.871 = 8973.864<br />
----------------------------------------<br />
ddikick.x: exited gracefully.<br />
blahblahblah
I want to harvest the number of basis functions (210 in this example) and the total amount of CPU times.
The line "NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS =" is unique to each file; ie, if I open the file in a text editor and search using this string, I only return this one line. Similarly for "CPU timing information for all processes" and "exited gracefully".
I appreciate that it appears that I haven't done a lot to help myself but I just don't know where to start. If someone could point me in the right direction, I hope to be able to fill in the rest.
After the help given to me by @Ben (see below) here is the code that I ended up using,
filesearch <- function (x) {
f <- readLines(x)
cline <- grep("NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS",f,
value=TRUE)
val <- as.numeric(str_extract(cline,"[0-9]+$"))
coline <- grep("^ +CPU timing information", f)
numstr <- sapply(str_extract_all(f[coline+2:5],"[0-9.]+"),as.numeric)
cline1 <- sum(numstr[4,])/60
output <- c(val, cline1)
return(cat(output,"\n"))
}
I sourced this function and keyed in the file that I needed each time, then I transferred the two results to another file by hand. Not as elegant as I'd like but it saved me a lot of time doing it this way. Thanks again to @Ben.
Upvotes: 7
Views: 14993
Reputation: 226057
maybe
library(stringr)
f <- readLines("datafile.txt")
cline <- grep("NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS",f,
value=TRUE)
val <- as.numeric(str_extract(cline,"[0-9]+$"))
will work?
To get the other values, try
cline <- grep("^ +CPU timing information",f)
(numstr <- sapply(str_extract_all(f[cline+2:5],"[0-9.]+"),as.numeric))
## [,1] [,2] [,3] [,4]
## [1,] 0.000 1.000 2.000 3.000
## [2,] 8853.469 8850.817 8851.925 8847.992
## [3,] 133.948 126.587 128.576 125.871
## [4,] 8987.417 8977.405 8980.501 8973.864
The sapply
has transposed the matrix of values, so the last row is the bit we want (corresponds to the last column in the file). Extract it using numstr[4,]
or numstr[nrow(numstr),]
or tail(numstr,1)
.
(edit: allow spaces before the "CPU timing" string) (edit: do it right!)
(To do this for all the log files, package it in a function and use list.files(pattern="\\.log$")
in combination with sapply
...)
Upvotes: 7