Extracting Data from a file which is on a different line from key words

Question

This is part of a file,

         GIAO CHEMICAL SHIELDING TENSOR (PPM):
                                                            ISOTROPIC
                      X             Y             Z         SHIELDING
                                                        (  ANISOTROPY )

1 C         X    192.9847       -0.3288        0.5647
            Y      0.8908      133.5254        1.9987
            Z     -1.5286        1.9986      131.2590
                                                             152.5897
 EIGENVALS:      192.9663      130.1130      134.6898
                                                        (     60.5649 )

2 O         X    293.7037      -11.3068       19.4099
            Y    -27.8836      337.6867      -38.0711
            Z     47.8680      -38.0711      380.8636
                                                             337.4180
 EIGENVALS:      283.3105      413.4345      315.5091
                                                        (    114.0247 )

3 H         X     32.4132       -2.6310       -3.6171
            Y     -0.9732       26.6966        2.2004
            Z     -1.6244        2.2423       28.7795
                                                              29.2964
 EIGENVALS:       34.4129       28.1896       25.2868
                                                        (      7.6748 )

4 H         X     32.4132        4.4443        0.5044
            Y      1.8931       30.1789        0.2675
            Z      0.0452        0.2257       25.2970
                                                              29.2964
 EIGENVALS:       34.4129       28.1895       25.2867
                                                        (      7.6748 )

5 H         X     31.3212       -2.3074        3.9610
            Y     -1.0235       26.9345       -1.8682
            Z      1.7569       -1.8682       29.0533
                                                              29.1030
 EIGENVALS:       33.8408       27.6219       25.8462
                                                        (      7.1067 )

6 H         X     32.4086       -3.5167        6.0369
            Y     -2.5502       27.9731       -8.1180
            Z      4.3777       -8.1180       37.1798
                                                              32.5205
 EIGENVALS:       29.5456       44.7719       23.2441
                                                        (     18.3770 )
 ..... DONE WITH NMR SHIELDINGS .....

I am interested in the data between the lines, "GIAO CHEMICAL SHIELDING TENSOR (PPM)" and "..... DONE WITH NMR SHIELDINGS ....."

which I want to turn into a data frame,

1C 152.5897  60.5649 
2O 337.4180  114.0247
3H 29.2964  7.6748
4H 29.2964  7.6748
5H 29.1030  7.1067
6H 32.5205  18.3770

I have other files of this nature but I don't know where to start. OK, I can use readLines() to get the file into R, but after that...?

Here's a link to the file, https://drive.google.com/file/d/0B2RulP80ivJaR0NVV1ZLRlpxVDA/view?usp=sharing rather than just my scraping of it. Some of the answers that work on the shortened, scraped version don't' work on the full file.

G. Grothendieck · Accepted Answer

Read in the file, extract lines lines that start with a (possibly empty) sequence of spaces and open parens followed by a digit. Then in the second line paste the first 3 characters and characters 58-68 of each such line and remove all spaces:

# extract portion of file between ix[1] and ix[2]
L <- readLines("myfile.dat")
ix <- grep("GIAO CHEMICAL SHIELDING TENSOR|DONE WITH NMR SHIELDING", L)
L0 <- L[seq(ix[1]+1, ix[2]-1)]

# extract fields
g <- grep("^[ (]*\d", L0, value = TRUE)
res <- gsub(" ", "", paste(substr(g, 1, 3), substr(g, 58, 68)))

This gives:

> res
[1] "1C"      "152.589" "60.564"  "2O"      "337.418" "114.024" "3H"     
[8] "29.296"  "7.674"   "4H"      "29.296"  "7.674"   "5H"      "29.103" 
[15] "7.106"   "6H"      "32.520"  "18.377"

The above gives what was asked for, but you might also want to reshape that into a matrix,

m <- matrix(res, ncol = 3, byrow = TRUE)

or data.frame,

data.frame(V1 = m[, 1], V2 = as.numeric(m[,2]), V3 = as.numeric(m[,3]))

Extracting Data from a file which is on a different line from key words

Answers (2)

Related Questions