Reputation: 2194
I would like to read a .txt file with the following structure:
ID Chr Allele:Effect ...
Q1 1 1:-0.133302 2: 0.007090
Q2 1 1:-0.050089 2: 0.021212
Q3 1 1: 0.045517 2:-0.038001
The problem is that
Finally, the result should look like:
qtl_id chr eff_1 eff_2
Q1 1 -0.133302 0.007090
Q2 1 -0.050089 0.021212
Q3 1 0.045517 -0.038001
edit
head(read.table(file = fpath, sep = "", header = TRUE))
yields
ID Chr Allele.Effect ...
Q1 1 1:-0.133302 2: 0.007090
Q2 1 1:-0.050089 2: 0.021212
Q3 1 1: 0.045517 2:-0.038001
Q4 1 1: 0.018582 2:-0.041846
Q5 1 1:-0.146560 2: 0.005473
Q6 1 1:-0.048240 2: 0.069418
Upvotes: 1
Views: 102
Reputation: 2960
I applied a brute force method. First the lines are read by 'readLines' without separation. 'gsub' removes the 'number:'-pieces, and 'strsplit' splits each line. The intermediate result is the list 'A'. 'A[[1]]' is the vector of column names, 'A[[2]]' represents the first row of the data frame, and so on. Then we can piece together the data frame 'df' from the components of 'A'.
A <- lapply( readLines(filename),
function(x)
{
strsplit( gsub( pattern="[0-9]+[ ]*[:][ ]*",
replacement="",
x=as.character(x) ),
"[ ]+")
} )
df <- data.frame()
for (n in 2:length(A)) { df <- rbind(df,t(unlist(A[[n]]))) }
colnames(df) = unlist(A[[1]])
Unfortunately all entries of 'df' are still factors:
> df
ID Chr Allele:Effect ...
1 Q1 1 -0.133302 0.007090
2 Q2 1 -0.050089 0.021212
3 Q3 1 0.045517 -0.038001
> df$"Allele:Effect"
[1] -0.133302 -0.050089 0.045517
Levels: -0.133302 -0.050089 0.045517
We can change this as follows:
for (m in 1:ncol(df))
{
n <- as.numeric(df[,m])
v <- suppressWarnings( as.numeric(levels(df[,m])[n]) )
if (!any(is.na(v))) { df[,m] <- v }
}
Now the numbers are numbers.
Upvotes: 1
Reputation: 22817
Use read.table with sep="". It is designed to handle this case. In the following example I saved your file to "qt.csv" and read it in that way. Worked fine.
> read.table("qt.csv",sep="",header=T)
1 qtl_id chr eff_1 eff_2
2 Q1 1 -0.133302 0.007090
3 Q2 1 -0.050089 0.021212
4 Q3 1 0.045517 -0.038001
Update:
To answer the new request (as I understand it), I read in the other file now, and deleted those pesky characters.
f <- readLines("allele2.csv")
f <- gsub("(\\d)\\: ","\\1",f) # get rid of the spaces after the colon
df <- read.table(textConnection(f),sep="",header=T)
df[[2]] <- as.numeric(gsub("^\\d\\:","",df[[2]]))
df[[3]] <- as.numeric(gsub("^\\d\\:","",df[[3]]))
df[[4]] <- as.numeric(gsub("^\\d\\:","",df[[4]]))
df
yielding
ID Chr Allele.Effect ...
1 Q1 1 -0.133302 20.007090
2 Q2 1 -0.050089 20.021212
3 Q3 1 10.045517 -0.038001
Upvotes: 1
Reputation: 14902
Assuming the data has the regular structure you have provided, then reading it as fixed columns should work. Here I suggest you prepare the file first by saving the file without the header, and replacing the textConnection()
argument to read.fwf()
below with your filename.
Note that by setting the neffects
value for your number of Allele:Effect columns, this will work on as many as you have in your file (in the example, 2).
myfile <-
"Q1 1 1:-0.133302 2: 0.007090
Q2 1 1:-0.050089 2: 0.021212
Q3 1 1: 0.045517 2:-0.038001"
# makes the solution general enough for any number of effects columns
neffects <- 2 # in example, two
# set up column names
cnames <- c("ID", "Chr", as.vector(outer(c("junk", "Effect"), 1:neffects, paste, sep = ".")))
# read the data as a fixed format
mydata <- read.fwf(textConnection(myfile), c(2, 11, rep(c(5, 9), neffects)),
col.names = cnames,
colClasses = c("character", "integer", rep(c("character", "numeric"), neffects)),
stringsAsFactors = FALSE)
# get rid of unwanted columns
mydata <- mydata[, -grep("^junk", colnames(mydata))]
mydata
## ID Chr Effect.1 Effect.2
## 1 Q1 1 -0.133302 0.007090
## 2 Q2 1 -0.050089 0.021212
## 3 Q3 1 0.045517 -0.038001
Upvotes: 1