Reputation: 1503
This is part of a file,
GIAO CHEMICAL SHIELDING TENSOR (PPM):
ISOTROPIC
X Y Z SHIELDING
( ANISOTROPY )
1 C X 192.9847 -0.3288 0.5647
Y 0.8908 133.5254 1.9987
Z -1.5286 1.9986 131.2590
152.5897
EIGENVALS: 192.9663 130.1130 134.6898
( 60.5649 )
2 O X 293.7037 -11.3068 19.4099
Y -27.8836 337.6867 -38.0711
Z 47.8680 -38.0711 380.8636
337.4180
EIGENVALS: 283.3105 413.4345 315.5091
( 114.0247 )
3 H X 32.4132 -2.6310 -3.6171
Y -0.9732 26.6966 2.2004
Z -1.6244 2.2423 28.7795
29.2964
EIGENVALS: 34.4129 28.1896 25.2868
( 7.6748 )
4 H X 32.4132 4.4443 0.5044
Y 1.8931 30.1789 0.2675
Z 0.0452 0.2257 25.2970
29.2964
EIGENVALS: 34.4129 28.1895 25.2867
( 7.6748 )
5 H X 31.3212 -2.3074 3.9610
Y -1.0235 26.9345 -1.8682
Z 1.7569 -1.8682 29.0533
29.1030
EIGENVALS: 33.8408 27.6219 25.8462
( 7.1067 )
6 H X 32.4086 -3.5167 6.0369
Y -2.5502 27.9731 -8.1180
Z 4.3777 -8.1180 37.1798
32.5205
EIGENVALS: 29.5456 44.7719 23.2441
( 18.3770 )
..... DONE WITH NMR SHIELDINGS .....
I am interested in the data between the lines, "GIAO CHEMICAL SHIELDING TENSOR (PPM)" and "..... DONE WITH NMR SHIELDINGS ....."
which I want to turn into a data frame,
1C 152.5897 60.5649
2O 337.4180 114.0247
3H 29.2964 7.6748
4H 29.2964 7.6748
5H 29.1030 7.1067
6H 32.5205 18.3770
I have other files of this nature but I don't know where to start. OK, I can use readLines() to get the file into R, but after that...?
Here's a link to the file, https://drive.google.com/file/d/0B2RulP80ivJaR0NVV1ZLRlpxVDA/view?usp=sharing rather than just my scraping of it. Some of the answers that work on the shortened, scraped version don't' work on the full file.
Upvotes: 0
Views: 87
Reputation: 269694
Read in the file, extract lines lines that start with a (possibly empty) sequence of spaces and open parens followed by a digit. Then in the second line paste the first 3 characters and characters 58-68 of each such line and remove all spaces:
# extract portion of file between ix[1] and ix[2]
L <- readLines("myfile.dat")
ix <- grep("GIAO CHEMICAL SHIELDING TENSOR|DONE WITH NMR SHIELDING", L)
L0 <- L[seq(ix[1]+1, ix[2]-1)]
# extract fields
g <- grep("^[ (]*\\d", L0, value = TRUE)
res <- gsub(" ", "", paste(substr(g, 1, 3), substr(g, 58, 68)))
This gives:
> res
[1] "1C" "152.589" "60.564" "2O" "337.418" "114.024" "3H"
[8] "29.296" "7.674" "4H" "29.296" "7.674" "5H" "29.103"
[15] "7.106" "6H" "32.520" "18.377"
The above gives what was asked for, but you might also want to reshape that into a matrix,
m <- matrix(res, ncol = 3, byrow = TRUE)
or data.frame,
data.frame(V1 = m[, 1], V2 = as.numeric(m[,2]), V3 = as.numeric(m[,3]))
Upvotes: 3
Reputation: 887223
You could also do:
library(stringr)
lines <- readLines("myfile.txt") #read the .txt file.
Extract the numbers within the round brackets or the digits followed by space followed by letters in the beginning of the each line. You can use regex lookbehind
to extract the contents within the brackets ie. (?<=\\() +[0-9.+]
- asserts that (
precedes immmediately before space followed by digits, dots.
val1 <- na.omit(str_trim(str_extract(lines, perl("(?<=\\() +[0-9.]+|^\\d+ [A-Za-z]+"))))
Create a logical index, which recycles to the length of the val1
and extracts the 1st, 3rd, 5th elements....
indx <- c(TRUE, FALSE)
Use either the indx
or !
of indx
to extract the elements of val1
into two columns V1
and V3
. For V2
, look for the lines that precedes EIGENVALS
and extract the numbers.
res <- data.frame(V1=gsub(" ", "", val1[indx]), V2=as.numeric(lines[grep("EIGENVALS",
lines)-1]), V3=as.numeric(val1[!indx]), stringsAsFactors=FALSE)
res
# V1 V2 V3
#1 1C 152.5897 60.5649
#2 2O 337.4180 114.0247
#3 3H 29.2964 7.6748
#4 4H 29.2964 7.6748
#5 5H 29.1030 7.1067
#6 6H 32.5205 18.3770
Upvotes: 1