I would like to read a .txt file with the following structure: ID Chr Allele:Effect ... Q1 1 1:-0.133302 2: 0.007090 Q2 1 1:-0.050089 2: 0.021212 Q3 1 1: 0.045517 2:-0.038001 The problem is that the field separator is a variable number of whitespaces, and I need to get rid of the leading numbers in the second and third column. Finally, the result should look like: qtl_id chr eff_1 eff_2 Q1 1 -0.133302 0.007090 Q2 1 -0.050089 0.021212 Q3 1 0.045517 -0.038001 edit head(read.table(file = fpath, sep = "", header = TRUE)) yields ID Chr Allele.Effect ... Q1 1 1:-0.133302 2: 0.007090 Q2 1 1:-0.050089 2: 0.021212 Q3 1 1: 0.045517 2:-0.038001 Q4 1 1: 0.018582 2:-0.041846 Q5 1 1:-0.146560 2: 0.005473 Q6 1 1:-0.048240 2: 0.069418

Reputation: 2194

Read in table from text file with variable number of whitespaces as separator

I would like to read a .txt file with the following structure:

ID      Chr   Allele:Effect ...
Q1        1     1:-0.133302   2: 0.007090
Q2        1     1:-0.050089   2: 0.021212
Q3        1     1: 0.045517   2:-0.038001

The problem is that

the field separator is a variable number of whitespaces, and
I need to get rid of the leading numbers in the second and third column.

Finally, the result should look like:

qtl_id    chr   eff_1         eff_2
Q1        1     -0.133302     0.007090
Q2        1     -0.050089     0.021212
Q3        1      0.045517     -0.038001

edit

head(read.table(file = fpath, sep = "", header = TRUE)) yields

ID         Chr Allele.Effect         ...
Q1  1 1:-0.133302            2:    0.007090
Q2  1 1:-0.050089            2:    0.021212
Q3  1          1:      0.045517 2:-0.038001
Q4  1          1:      0.018582 2:-0.041846
Q5  1 1:-0.146560            2:    0.005473
Q6  1 1:-0.048240            2:    0.069418

Upvotes: 1

Answers (3)

mra68

Reputation: 2960

I applied a brute force method. First the lines are read by 'readLines' without separation. 'gsub' removes the 'number:'-pieces, and 'strsplit' splits each line. The intermediate result is the list 'A'. 'A[[1]]' is the vector of column names, 'A[[2]]' represents the first row of the data frame, and so on. Then we can piece together the data frame 'df' from the components of 'A'.

A <- lapply( readLines(filename),
             function(x)
             {
               strsplit( gsub( pattern="[0-9]+[ ]*[:][ ]*",
                               replacement="",
                               x=as.character(x)                        ),
                         "[ ]+")
             } )

df <- data.frame()
for (n in 2:length(A)) { df <- rbind(df,t(unlist(A[[n]]))) }
colnames(df) = unlist(A[[1]])

Unfortunately all entries of 'df' are still factors:

> df
  ID Chr Allele:Effect       ...
1 Q1   1     -0.133302  0.007090
2 Q2   1     -0.050089  0.021212
3 Q3   1      0.045517 -0.038001
> df$"Allele:Effect"
[1] -0.133302 -0.050089 0.045517 
Levels: -0.133302 -0.050089 0.045517

We can change this as follows:

for (m in 1:ncol(df))
{
  n <- as.numeric(df[,m])
  v <- suppressWarnings( as.numeric(levels(df[,m])[n]) )

  if (!any(is.na(v))) { df[,m] <- v }
}

Now the numbers are numbers.

Upvotes: 1

Mike Wise

Reputation: 22817

Use read.table with sep="". It is designed to handle this case. In the following example I saved your file to "qt.csv" and read it in that way. Worked fine.

> read.table("qt.csv",sep="",header=T)
1 qtl_id chr     eff_1     eff_2
2     Q1   1 -0.133302  0.007090
3     Q2   1 -0.050089  0.021212
4     Q3   1  0.045517 -0.038001

Update:

To answer the new request (as I understand it), I read in the other file now, and deleted those pesky characters.

f <- readLines("allele2.csv")
f <- gsub("(\\d)\\: ","\\1",f)   # get rid of the spaces after the colon

df <- read.table(textConnection(f),sep="",header=T)

df[[2]] <- as.numeric(gsub("^\\d\\:","",df[[2]]))
df[[3]] <- as.numeric(gsub("^\\d\\:","",df[[3]]))
df[[4]] <- as.numeric(gsub("^\\d\\:","",df[[4]]))
df

yielding

  ID Chr Allele.Effect       ...
1 Q1   1     -0.133302 20.007090
2 Q2   1     -0.050089 20.021212
3 Q3   1     10.045517 -0.038001

Upvotes: 1

Ken Benoit

Reputation: 14902

Assuming the data has the regular structure you have provided, then reading it as fixed columns should work. Here I suggest you prepare the file first by saving the file without the header, and replacing the textConnection() argument to read.fwf() below with your filename.

Note that by setting the neffects value for your number of Allele:Effect columns, this will work on as many as you have in your file (in the example, 2).

myfile <- 
"Q1        1     1:-0.133302   2: 0.007090
Q2        1     1:-0.050089   2: 0.021212
Q3        1     1: 0.045517   2:-0.038001"

# makes the solution general enough for any number of effects columns
neffects <- 2  # in example, two

# set up column names
cnames <- c("ID", "Chr", as.vector(outer(c("junk", "Effect"), 1:neffects, paste, sep = ".")))

# read the data as a fixed format
mydata <- read.fwf(textConnection(myfile), c(2, 11, rep(c(5, 9), neffects)), 
                   col.names = cnames,
                   colClasses = c("character", "integer", rep(c("character", "numeric"), neffects)),
                   stringsAsFactors = FALSE)
# get rid of unwanted columns
mydata <- mydata[, -grep("^junk", colnames(mydata))]

mydata
##   ID Chr  Effect.1  Effect.2
## 1 Q1   1 -0.133302  0.007090
## 2 Q2   1 -0.050089  0.021212
## 3 Q3   1  0.045517 -0.038001

Upvotes: 1

Read in table from text file with variable number of whitespaces as separator

Answers (3)

Related Questions