Reputation: 852
I have a text file formatted as such:
# title "Secondary Structure"
# xaxis label "Time (ns)"
# yaxis label "Number of Residues"
#TYPE xy
# subtitle "Structure = A-Helix + B-Sheet + B-Bridge + Turn"
# view 0.15, 0.15, 0.75, 0.85
# legend on
# legend box on
# legend loctype view
# legend 0.78, 0.8
# legend length 2
# s0 legend "Structure"
# s1 legend "Coil"
# s2 legend "B-Sheet"
# s3 legend "B-Bridge"
# s4 legend "Bend"
# s5 legend "Turn"
# s6 legend "A-Helix"
# s7 legend "5-Helix"
# s8 legend "3-Helix"
# s9 legend "Chain_Separator"
0 637 180 201 7 94 129 300 0 47 1
1 617 189 191 11 99 121 294 5 48 1
2 625 183 198 7 97 130 290 0 53 1
3 625 180 195 5 102 125 300 0 51 1
4 622 185 196 5 99 117 304 0 52 1
5 615 192 190 5 106 121 299 0 45 1
6 629 187 196 7 102 122 304 0 40 1
I'm trying to to match the lines starting with "s+number" (s0,s1,s2,...s9) and save the values between "" in a list so I can then use this list for naming the columns.
list <- c("Structure", "Coil","B-Sheet", ..., "Chain_Separato")
names(data) <- list
The problem is that I can't match the single words but only the entire lines.
grep('s\\d\\s[a-z]{6}\\s\"([A-z-9]+)\"',readLines("file.xvg"),perl=T,value=T)
[1] "# s0 legend \"Structure\"" "# s1 legend \"Coil\""
[3] "# s2 legend \"B-Sheet\"" "# s3 legend \"B-Bridge\""
[5] "# s4 legend \"Bend\"" "# s5 legend \"Turn\""
[7] "# s6 legend \"A-Helix\"" "# s9 legend \"Chain_Separator\""
I tried several regex, like '# s[0-9] [a-z]+ "([A-z-9]+)"'
, all working in perl but in R I'm always matching the entire line and not the word.
Isn't the () used to capture the value? What am I doing wrong?
Upvotes: 2
Views: 228
Reputation: 99331
You can use a system command in fread()
. For example, on a file named "file.txt" you can do
library(data.table)
fread("grep '^# s[0-9]\\+' file.txt", header = FALSE, select = 4)[[1]]
# [1] "Structure" "Coil" "B-Sheet"
# [4] "B-Bridge" "Bend" "Turn"
# [7] "A-Helix" "5-Helix" "3-Helix"
# [10] "Chain_Separator"
Note: This uses data.table dev version 1.9.5
Basically the area you're looking for in the text has four columns. ^# s[0-9]\\+
looks for lines that begin with #
and then a space, then s
, then any number of digits. select = 4
takes the last column, and [[1]]
drops it down from a single column data table into a character vector.
Thanks to @BrodieG for help with the regex.
Upvotes: 2
Reputation: 887128
If you use linux
, awk
commands can be combined with read.table
using pipe
read.table(pipe("awk 'BEGIN {FS=\" \"}/# s[0-9]/ { print$4 }' fra.txt"),
stringsAsFactors=FALSE)$V1
# [1] "Structure" "Coil" "B-Sheet" "B-Bridge"
# [5] "Bend" "Turn" "A-Helix" "5-Helix"
# [9] "3-Helix" "Chain_Separator"
The above command also works with fread
fread("awk 'BEGIN {FS=\" \"}/# s[0-9]/ { print$4 }' fra.txt",
header=FALSE)$V1
Upvotes: 1
Reputation: 31171
You can do this:
conn = file(fileName,open="r")
lines=readLines(conn)
lst = Filter(function(u) grepl('^# s[0-9]+', u), lines)
result = gsub('.*\"(.*)\".*','\\1',lst)
close(conn)
#> result
#[1] "Structure" "Coil" "B-Sheet" "B-Bridge" "Bend" "Turn" "A-Helix" "5-Helix"
#[9] "3-Helix" "Chain_Separator"
Upvotes: 2