Reputation: 59
Below is a sample template of my file where I would like to extract the numerical data and create a dataframe, (file name : mylog.log)
Contents of the file
*****************************************************************
****** option summary
******
runlvl = 3 bypass = 2
Opening plot unit= 15
file=new_run.pa0
******
******** dc transfer curves tnom= 25.000 temp= 25.000 *****
x
volt current
v0
1.0000e-01 -2.570e-11
2.0000e-01 -2.769e-11
3.0000e-01 -2.795e-11
4.0000e-01 -2.803e-11
5.0000e-01 -2.814e-11
6.0000e-01 -2.823e-11
7.0000e-01 -2.833e-11
8.0000e-01 -2.842e-11
9.0000e-01 -2.861e-11
1.0000e+00 -2.871e-11
1.1000e+00 -2.880e-11
1.2000e+00 -2.889e-11
1.3000e+00 -2.899e-11
1.4000e+00 -2.908e-11
1.5000e+00 -2.916e-11
1.6000e+00 -2.925e-11
1.7000e+00 -2.934e-11
1.8000e+00 -2.942e-11
1.9000e+00 -2.950e-11
2.0000e+00 -2.958e-11
2.1000e+00 -2.966e-11
2.2000e+00 -2.973e-11
2.3000e+00 -2.980e-11
2.4000e+00 -2.986e-11
2.5000e+00 -3.022e-11
2.6000e+00 -3.030e-11
2.7000e+00 -3.039e-11
2.8000e+00 -3.046e-11
2.9000e+00 -3.053e-11
3.0000e+00 -3.059e-11
3.1000e+00 -3.066e-11
3.2000e+00 -3.110e-11
3.3000e+00 -3.119e-11
3.4000e+00 -3.125e-11
3.5000e+00 -3.132e-11
3.6000e+00 -3.140e-11
3.7000e+00 -3.145e-11
3.8000e+00 -3.152e-11
3.9000e+00 -3.161e-11
4.0000e+00 -3.167e-11
4.1000e+00 -3.175e-11
4.2000e+00 -3.175e-11
4.3000e+00 -3.182e-11
4.4000e+00 -3.189e-11
4.5000e+00 -3.197e-11
4.6000e+00 -3.199e-11
4.7000e+00 -3.202e-11
4.8000e+00 -3.212e-11
4.9000e+00 -3.219e-11
5.0000e+00 -3.278e-11
5.1000e+00 -3.281e-11
5.2000e+00 -3.297e-11
5.3000e+00 -3.295e-11
5.4000e+00 -3.301e-11
5.5000e+00 -3.299e-11
5.6000e+00 -3.310e-11
5.7000e+00 -3.381e-11
5.8000e+00 -3.384e-11
5.9000e+00 -3.396e-11
6.0000e+00 -3.398e-11
6.1000e+00 -3.405e-11
6.2000e+00 -3.411e-11
6.3000e+00 -3.413e-11
6.4000e+00 -3.424e-11
6.5000e+00 -3.420e-11
6.6000e+00 -3.432e-11
6.7000e+00 -3.438e-11
6.8000e+00 -3.444e-11
6.9000e+00 -3.450e-11
7.0000e+00 -3.444e-11
7.1000e+00 -3.455e-11
3.9300e+01 -8.790e+00
3.9400e+01 -8.824e+00
3.9500e+01 -8.859e+00
3.9600e+01 -8.894e+00
3.9700e+01 -8.928e+00
3.9800e+01 -8.963e+00
3.9900e+01 -8.998e+00
4.0000e+01 -9.033e+00
y
***** job concluded
****** HSPICE -- H-2013.03-SP2 32-BIT (Aug 26 2013) RHEL32 ******
******
****** job statistics summary tnom= 25.000 temp= 25.000 *****
****** Machine Information ******
I would like to extract the values that are the below the line v0 and create a two column dataframe.
Volt Current
-100.00000m 406.5220f
"........................"
for all the rows till the numerical data is present. The fact of the matter is, the number of lines before the line volt and current is dynamic and the number of lines after the numerical data gets over is also dynamic. The numerical data rows are also dynamic. One thing that can be considered is numerical data will always start after the below two lines of the file,
volt current
v0
Below is my code which I have tried by specifying line numbers of the data,
DATA <- readLines(myfile)
DataStartPos <- 314
DataEndPos <- 1062
#Seperate numeric data and META data and bind to data frame
tmp <- as.data.frame(DATA[DataStartPos:DataEndPos])
tmp <- separate(tmp, col = 1, c("S.No", "Volts", "Amps"), sep = "\\s+")
I'm now able to create a dataframe but only by specifying line numbers of the file statically. Is there any way to identify the numerical data with the explained case.
Upvotes: 0
Views: 162
Reputation: 2071
fread
from the data.table
package will work nicely. It will give you a bunch of warning messages if you just use the default parameters. You could reduce the amount of warning messages if you took some care into specifying how many lines to skip.
#with a file
df <- fread(file ="path/to/target_file")
df
V1 v0
# 1: -100.00000m 406.5220f
# 2: -200.00000m 806.6048f
# 3: -300.00000m 1.2066p
# 4: -400.00000m 1.6067p
# 5: -500.00000m 2.0067p
# 6: -600.00000m 2.4066p
# 7: -700.00000m 2.8066p
# 8: -800.00000m 3.2067p
# 9: -900.00000m 3.6067p
#10: -1.00000 4.0067p
#11: -1.10000 4.4067p
#12: -1.20000 4.8068p
#13: -1.30000 5.2069p
#14: -1.40000 5.6068p
#15: -1.50000 6.0070p
#16: -1.60000 6.4069p
#17: -1.70000 6.8070p
#18: -1.80000 7.2069p
#19: -1.90000 7.6070p
#20: -2.00000 8.0069p
#21: -2.10000 8.4071p
#22: -2.20000 8.8070p
#23: -2.30000 9.2071p
#24: -2.40000 9.6070p**
# V1 v0
The data.table
package is very dynamic and flexible. Assuming there is something consistent about your input files - fread
can potentially accomidate. To answer your question if you look at the documentation of fread
you will find the skip
argument
skip
If 0 (default) start on the first line and from there finds the first row
with a consistent number of columns. This automatically avoids irregular
header information before the column names row. skip>0 means ignore the
first skip rows manually. skip="string" searches for "string" in the file
(e.g. a substring of the column names row) and starts on that line
(inspired by read.xls in package gdata).
So you could do...
df <- fread(file ="path/to/target_file", skip = "volt", col.names = c("volt","current"))
and it will work the same.
Depending on your OS (I do not know if this will work on Windows computers, but it should work on Mac and Linux like systems) you can use the cmd
argument to parse the file before reading it into fread
.
file_path <- "path/to/target_file"
fread(cmd = paste("sed -n '/^x/,/^y/p'", file_path), col.names = c("volt","current")
The sed
command is diverse but this structure will print lines between /1/,/2/ - specifically lines between the patterns x
and y
. The ^
means that the following character should be the first character on that line to work.
If you knew that x
and y
will be the only characters on the line you can make this more specific by making your command "sed -n '/^x$/,/^y$/p'"
. Again, I do not know if this will work on a windows but it works on my end.
Upvotes: 1