Reputation: 147
I am looking to parse a text file in R to be loaded as a data.frame. I have a long text file with fixed width data, seperated by sections (ID) and subsections (SUB). The length of each section is variable. I'm looking to create two data frames, one for the ID section and one for the SUB section. The example data is as follows:
Header 1
METRIC 0.30 10.00
ID K0107050 Aa
0.06 15.24 14.40 14.40 7.13 0.13 0.19 1
0.17 14.35 13.57 13.57 6.40 0.12 0.18 1
SUB
1.000 1.000 0.093 0.11 0.11 301
1.000 1.000 0.093 0.11 0.11 61
ID K0129050 Aa
0.06 26.35 24.90 24.90 10.88 0.62 0.88 1
0.15 25.35 23.96 23.96 10.93 0.55 0.74 1
SUB
3.000 3.000 0.506 0.53 0.53 102
4.000 4.000 0.514 0.55 0.55 118
The dataframe(s) I would like are:
DF1
Header 1 K0107050 Aa 0.06 15.24 14.40 14.40 7.13 0.13 0.19 1
Header 1 K0107050 Aa 0.17 14.35 13.57 13.57 6.40 0.12 0.18 1
Header 1 K0129050 Aa 0.06 26.35 24.90 24.90 10.88 0.62 0.88 1
Header 1 K0129050 Aa 0.15 25.35 23.96 23.96 10.93 0.55 0.74 1
DF2
Header 1 K0107050 Aa 1.000 1.000 0.093 0.11 0.11 301
Header 1 K0107050 Aa 1.000 1.000 0.093 0.11 0.11 61
Header 1 K0129050 Aa 3.000 3.000 0.506 0.53 0.53 102
Header 1 K0129050 Aa 4.000 4.000 0.514 0.55 0.55 118
I've gotten so far as to use the readLines() but get stuck after that, given the different sections in the text file. Thank you
Upvotes: 3
Views: 1294
Reputation: 56169
Here is the start (sorry time to bed...):
x <- readLines("myFile.txt")
library(dplyr)
bind_rows(
lapply(split(x, cumsum(grepl("Header|Metric|ID|SUB", x))), function(i){
i1 <- i[ i != "" ]
nums <- unlist(strsplit(tail(i1, -1), " "))
res <- cbind.data.frame(Grp = i1[1],
matrix(na.omit(as.numeric(nums)),
nrow = length(i1) - 1, byrow = TRUE),
stringsAsFactors = FALSE)
res
})
)
# Grp 1 2 3 4 5 6 7 8
# 1 Header 1 0.30 10.00 NA NA NA NA NA NA
# 2 ID K0107050 Aa 0.06 15.24 14.400 14.40 7.13 0.13 0.19 1
# 3 ID K0107050 Aa 0.17 14.35 13.570 13.57 6.40 0.12 0.18 1
# 4 SUB 1.00 1.00 0.093 0.11 0.11 301.00 NA NA
# 5 SUB 1.00 1.00 0.093 0.11 0.11 61.00 NA NA
# 6 ID K0129050 Aa 0.06 26.35 24.900 24.90 10.88 0.62 0.88 1
# 7 ID K0129050 Aa 0.15 25.35 23.960 23.96 10.93 0.55 0.74 1
# 8 SUB 3.00 3.00 0.506 0.53 0.53 102.00 NA NA
# 9 SUB 4.00 4.00 0.514 0.55 0.55 118.00 NA NA
Upvotes: 2