SSK
SSK

Reputation: 59

How to extract numerical data from a file and create a dataframe in R

Below is a sample template of my file where I would like to extract the numerical data and create a dataframe, (file name : mylog.log)

Contents of the file

*****************************************************************
 ******  option summary
 ******
 runlvl  = 3         bypass  = 2         
  Opening plot unit= 15
 file=new_run.pa0

 ******  
 

  ********  dc transfer curves tnom=  25.000 temp=  25.000 *****
x
        


     volt    current    
                    v0     
  1.0000e-01   -2.570e-11  
  2.0000e-01   -2.769e-11  
  3.0000e-01   -2.795e-11  
  4.0000e-01   -2.803e-11  
  5.0000e-01   -2.814e-11  
  6.0000e-01   -2.823e-11  
  7.0000e-01   -2.833e-11  
  8.0000e-01   -2.842e-11  
  9.0000e-01   -2.861e-11  
  1.0000e+00   -2.871e-11  
  1.1000e+00   -2.880e-11  
  1.2000e+00   -2.889e-11  
  1.3000e+00   -2.899e-11  
  1.4000e+00   -2.908e-11  
  1.5000e+00   -2.916e-11  
  1.6000e+00   -2.925e-11  
  1.7000e+00   -2.934e-11  
  1.8000e+00   -2.942e-11  
  1.9000e+00   -2.950e-11  
  2.0000e+00   -2.958e-11  
  2.1000e+00   -2.966e-11  
  2.2000e+00   -2.973e-11  
  2.3000e+00   -2.980e-11  
  2.4000e+00   -2.986e-11  
  2.5000e+00   -3.022e-11  
  2.6000e+00   -3.030e-11  
  2.7000e+00   -3.039e-11  
  2.8000e+00   -3.046e-11  
  2.9000e+00   -3.053e-11  
  3.0000e+00   -3.059e-11  
  3.1000e+00   -3.066e-11  
  3.2000e+00   -3.110e-11  
  3.3000e+00   -3.119e-11  
  3.4000e+00   -3.125e-11  
  3.5000e+00   -3.132e-11  
  3.6000e+00   -3.140e-11  
  3.7000e+00   -3.145e-11  
  3.8000e+00   -3.152e-11  
  3.9000e+00   -3.161e-11  
  4.0000e+00   -3.167e-11  
  4.1000e+00   -3.175e-11  
  4.2000e+00   -3.175e-11  
  4.3000e+00   -3.182e-11  
  4.4000e+00   -3.189e-11  
  4.5000e+00   -3.197e-11  
  4.6000e+00   -3.199e-11  
  4.7000e+00   -3.202e-11  
  4.8000e+00   -3.212e-11  
  4.9000e+00   -3.219e-11  
  5.0000e+00   -3.278e-11  
  5.1000e+00   -3.281e-11  
  5.2000e+00   -3.297e-11  
  5.3000e+00   -3.295e-11  
  5.4000e+00   -3.301e-11  
  5.5000e+00   -3.299e-11  
  5.6000e+00   -3.310e-11  
  5.7000e+00   -3.381e-11  
  5.8000e+00   -3.384e-11  
  5.9000e+00   -3.396e-11  
  6.0000e+00   -3.398e-11  
  6.1000e+00   -3.405e-11  
  6.2000e+00   -3.411e-11  
  6.3000e+00   -3.413e-11  
  6.4000e+00   -3.424e-11  
  6.5000e+00   -3.420e-11  
  6.6000e+00   -3.432e-11  
  6.7000e+00   -3.438e-11  
  6.8000e+00   -3.444e-11  
  6.9000e+00   -3.450e-11  
  7.0000e+00   -3.444e-11  
  7.1000e+00   -3.455e-11  
  3.9300e+01   -8.790e+00  
  3.9400e+01   -8.824e+00  
  3.9500e+01   -8.859e+00  
  3.9600e+01   -8.894e+00  
  3.9700e+01   -8.928e+00  
  3.9800e+01   -8.963e+00  
  3.9900e+01   -8.998e+00  
  4.0000e+01   -9.033e+00  
y

          ***** job concluded
 ****** HSPICE -- H-2013.03-SP2 32-BIT (Aug 26 2013) RHEL32 ******              
 ******  
 

  ******  job statistics summary tnom=  25.000 temp=  25.000 *****
  
  
 ******  Machine Information  ******

I would like to extract the values that are the below the line v0 and create a two column dataframe.

Volt             Current
-100.00000m      406.5220f
"........................"

for all the rows till the numerical data is present. The fact of the matter is, the number of lines before the line volt and current is dynamic and the number of lines after the numerical data gets over is also dynamic. The numerical data rows are also dynamic. One thing that can be considered is numerical data will always start after the below two lines of the file,

volt      current    
                        v0 

Below is my code which I have tried by specifying line numbers of the data,

DATA <- readLines(myfile)
  
  DataStartPos <- 314
  DataEndPos <- 1062
  
  
  #Seperate numeric data and META data and bind to data frame
  
    tmp <- as.data.frame(DATA[DataStartPos:DataEndPos])
    tmp <- separate(tmp, col = 1, c("S.No", "Volts", "Amps"), sep = "\\s+")

I'm now able to create a dataframe but only by specifying line numbers of the file statically. Is there any way to identify the numerical data with the explained case.

Upvotes: 0

Views: 162

Answers (1)

Justin Landis
Justin Landis

Reputation: 2071

fread from the data.table package will work nicely. It will give you a bunch of warning messages if you just use the default parameters. You could reduce the amount of warning messages if you took some care into specifying how many lines to skip.

#with a file
df <- fread(file ="path/to/target_file")
df
             V1        v0
# 1: -100.00000m 406.5220f
# 2: -200.00000m 806.6048f
# 3: -300.00000m   1.2066p
# 4: -400.00000m   1.6067p
# 5: -500.00000m   2.0067p
# 6: -600.00000m   2.4066p
# 7: -700.00000m   2.8066p
# 8: -800.00000m   3.2067p
# 9: -900.00000m   3.6067p
#10:    -1.00000   4.0067p
#11:    -1.10000   4.4067p
#12:    -1.20000   4.8068p
#13:    -1.30000   5.2069p
#14:    -1.40000   5.6068p
#15:    -1.50000   6.0070p
#16:    -1.60000   6.4069p
#17:    -1.70000   6.8070p
#18:    -1.80000   7.2069p
#19:    -1.90000   7.6070p
#20:    -2.00000   8.0069p
#21:    -2.10000   8.4071p
#22:    -2.20000   8.8070p
#23:    -2.30000   9.2071p
#24:    -2.40000 9.6070p**
#             V1        v0

EDIT 1 - skipping lines

The data.table package is very dynamic and flexible. Assuming there is something consistent about your input files - fread can potentially accomidate. To answer your question if you look at the documentation of fread you will find the skip argument

skip    
  If 0 (default) start on the first line and from there finds the first row
  with a consistent number of columns. This automatically avoids irregular 
  header information before the column names row. skip>0 means ignore the 
  first skip rows manually. skip="string" searches for "string" in the file 
  (e.g. a substring of the column names row) and starts on that line 
  (inspired by read.xls in package gdata).

So you could do...

df <- fread(file ="path/to/target_file", skip = "volt", col.names = c("volt","current"))

and it will work the same.

EDIT 2 - lines between 2 patterns

Depending on your OS (I do not know if this will work on Windows computers, but it should work on Mac and Linux like systems) you can use the cmd argument to parse the file before reading it into fread.

file_path <- "path/to/target_file"
fread(cmd = paste("sed -n '/^x/,/^y/p'", file_path), col.names = c("volt","current")

The sed command is diverse but this structure will print lines between /1/,/2/ - specifically lines between the patterns x and y. The ^ means that the following character should be the first character on that line to work. If you knew that x and y will be the only characters on the line you can make this more specific by making your command "sed -n '/^x$/,/^y$/p'". Again, I do not know if this will work on a windows but it works on my end.

Upvotes: 1

Related Questions