ShanZhengYang
ShanZhengYang

Reputation: 17631

How to parse tab-delimited data (of different formats) into a data.table/data.frame?

I am trying to parse tab-delimited data, which has been saved as a text file with extraneous data. I would like this to be an R data.table/data.frame.

The tab-delimited format is the following:

A   1092    -   1093    +   1X
B   1093    HRDCPMRFYT
A   1093    +   1094    -   1X
B   1094    BSZSDFJRVF
A   1094    +   1095    +   1X
B   1095    SSTFCLEPVV
...

There are only two types of rows, A and B. A consistently has 5 columns, e.g. for the first row,

1092    -   1093    +   1X

B consistently has two columns:

1093    HRDCPMRFYT

Question: How do you parse a file with "alternating" rows with different formats?

Let's say that this was a text file which was only of this format, alternating rows of A and B, with 5 columns and 2 columns respectively. How do you parse this into an R data.table? My idea how be to create the following format:

1092    -    1093    +    1X    1093    HRDCPMRFYT
1093    +    1094    -    1X    1094    BSZSDFJRVF
1094    +    1095    +    1X    1095    SSTFCLEPVV
... 

Upvotes: 4

Views: 248

Answers (2)

David Arenburg
David Arenburg

Reputation: 92282

You can run shell commands using fread. In Win10, you can even run some linux utilities such as sed

Hence, you can simply do

fread("sed '$!N;s/\\n/ /' test.tab")
#      V1 V2   V3 V4      V5         V6
# 1: 1092  - 1093  + 1X 1093 HRDCPMRFYT
# 2: 1093  + 1094  - 1X 1094 BSZSDFJRVF
# 3: 1094  + 1095  + 1X 1095 SSTFCLEPVV

(sed syntax taken from here)


Data

text <- "1092    -   1093    +   1X
1093    HRDCPMRFYT
1093    +   1094    -   1X
1094    BSZSDFJRVF
1094    +   1095    +   1X
1095    SSTFCLEPVV"

# Saving it as tab separated file on disk
write(gsub(" +", "\t", text), file = "test.tab")

Upvotes: 5

user20650
user20650

Reputation: 25854

One way to go is to read in your data with readLines, pull out the bits you want, and pass to read.table to form the dataframe. So if the rows are alternating then:

txt <- 
'1092    -   1093    +   1X
1093    HRDCPMRFYT
1093    +   1094    -   1X
1094    BSZSDFJRVF
1094    +   1095    +   1X
1095    SSTFCLEPVV'


rd <- readLines(textConnection(txt))
data.frame(read.table(text=rd[c(TRUE, FALSE)]), 
           read.table(text=rd[c(FALSE, TRUE)]))

Change textConnection(txt) to your file path


Another way is to read in only once and then post-process

r <- read.table(text=txt, fill=TRUE, stringsAsFactors=FALSE, na.strings = "")
d <- cbind(r[c(TRUE, FALSE),], r[c(FALSE, TRUE),])
d[ colSums(is.na(d)) < nrow(d)]

Upvotes: 4

Related Questions