DarrenRhodes
DarrenRhodes

Reputation: 1503

Extracting Data from a file which is on a different line from key words

This is part of a file,

         GIAO CHEMICAL SHIELDING TENSOR (PPM):
                                                            ISOTROPIC
                      X             Y             Z         SHIELDING
                                                        (  ANISOTROPY )

1 C         X    192.9847       -0.3288        0.5647
            Y      0.8908      133.5254        1.9987
            Z     -1.5286        1.9986      131.2590
                                                             152.5897
 EIGENVALS:      192.9663      130.1130      134.6898
                                                        (     60.5649 )

2 O         X    293.7037      -11.3068       19.4099
            Y    -27.8836      337.6867      -38.0711
            Z     47.8680      -38.0711      380.8636
                                                             337.4180
 EIGENVALS:      283.3105      413.4345      315.5091
                                                        (    114.0247 )

3 H         X     32.4132       -2.6310       -3.6171
            Y     -0.9732       26.6966        2.2004
            Z     -1.6244        2.2423       28.7795
                                                              29.2964
 EIGENVALS:       34.4129       28.1896       25.2868
                                                        (      7.6748 )

4 H         X     32.4132        4.4443        0.5044
            Y      1.8931       30.1789        0.2675
            Z      0.0452        0.2257       25.2970
                                                              29.2964
 EIGENVALS:       34.4129       28.1895       25.2867
                                                        (      7.6748 )

5 H         X     31.3212       -2.3074        3.9610
            Y     -1.0235       26.9345       -1.8682
            Z      1.7569       -1.8682       29.0533
                                                              29.1030
 EIGENVALS:       33.8408       27.6219       25.8462
                                                        (      7.1067 )

6 H         X     32.4086       -3.5167        6.0369
            Y     -2.5502       27.9731       -8.1180
            Z      4.3777       -8.1180       37.1798
                                                              32.5205
 EIGENVALS:       29.5456       44.7719       23.2441
                                                        (     18.3770 )
 ..... DONE WITH NMR SHIELDINGS .....

I am interested in the data between the lines, "GIAO CHEMICAL SHIELDING TENSOR (PPM)" and "..... DONE WITH NMR SHIELDINGS ....."

which I want to turn into a data frame,

1C 152.5897  60.5649 
2O 337.4180  114.0247
3H 29.2964  7.6748
4H 29.2964  7.6748
5H 29.1030  7.1067
6H 32.5205  18.3770

I have other files of this nature but I don't know where to start. OK, I can use readLines() to get the file into R, but after that...?

Here's a link to the file, https://drive.google.com/file/d/0B2RulP80ivJaR0NVV1ZLRlpxVDA/view?usp=sharing rather than just my scraping of it. Some of the answers that work on the shortened, scraped version don't' work on the full file.

Upvotes: 0

Views: 87

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269694

Read in the file, extract lines lines that start with a (possibly empty) sequence of spaces and open parens followed by a digit. Then in the second line paste the first 3 characters and characters 58-68 of each such line and remove all spaces:

# extract portion of file between ix[1] and ix[2]
L <- readLines("myfile.dat")
ix <- grep("GIAO CHEMICAL SHIELDING TENSOR|DONE WITH NMR SHIELDING", L)
L0 <- L[seq(ix[1]+1, ix[2]-1)]

# extract fields
g <- grep("^[ (]*\\d", L0, value = TRUE)
res <- gsub(" ", "", paste(substr(g, 1, 3), substr(g, 58, 68)))

This gives:

> res
[1] "1C"      "152.589" "60.564"  "2O"      "337.418" "114.024" "3H"     
[8] "29.296"  "7.674"   "4H"      "29.296"  "7.674"   "5H"      "29.103" 
[15] "7.106"   "6H"      "32.520"  "18.377" 

The above gives what was asked for, but you might also want to reshape that into a matrix,

m <- matrix(res, ncol = 3, byrow = TRUE)

or data.frame,

data.frame(V1 = m[, 1], V2 = as.numeric(m[,2]), V3 = as.numeric(m[,3]))

Upvotes: 3

akrun
akrun

Reputation: 887223

You could also do:

 library(stringr)

 lines <- readLines("myfile.txt") #read the .txt file.

Extract the numbers within the round brackets or the digits followed by space followed by letters in the beginning of the each line. You can use regex lookbehind to extract the contents within the brackets ie. (?<=\\() +[0-9.+] - asserts that ( precedes immmediately before space followed by digits, dots.

 val1 <- na.omit(str_trim(str_extract(lines, perl("(?<=\\() +[0-9.]+|^\\d+ [A-Za-z]+"))))

Create a logical index, which recycles to the length of the val1 and extracts the 1st, 3rd, 5th elements....

  indx <- c(TRUE, FALSE)

Use either the indx or ! of indx to extract the elements of val1 into two columns V1 and V3. For V2, look for the lines that precedes EIGENVALS and extract the numbers.

  res <- data.frame(V1=gsub(" ", "", val1[indx]), V2=as.numeric(lines[grep("EIGENVALS",
             lines)-1]),  V3=as.numeric(val1[!indx]), stringsAsFactors=FALSE)

 res
  #  V1       V2       V3
  #1 1C 152.5897  60.5649
  #2 2O 337.4180 114.0247
  #3 3H  29.2964   7.6748
  #4 4H  29.2964   7.6748
  #5 5H  29.1030   7.1067
  #6 6H  32.5205  18.3770

Upvotes: 1

Related Questions