Reputation: 542
I have a tab-delimited file that looks like this:
"ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
I use this code to read in the data:
df <- read.table("path/to/file",header=TRUE,fill=TRUE)
The result is this:
df
id V1 V2 V3 V4 V5
1 1 A 1 NA NA NA
2 2 B 2 NA NA NA
But I expect this:
df
id V1 V2 V3 V4 V5
1 1 A NA NA NA 1
2 2 B NA NA NA 2
I've tried sep="\t"
and na.strings=c(""," ",NULL)
but those don't help.
Upvotes: 2
Views: 124
Reputation: 185
As Quar already mentioned in his/her comment, your file has an extra tab in the beginning of every line, so the number of column labels does not match the number of data fields:
> foo <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
> cat(foo, "\n")
ID V1 V2 V3 V4 V5
1 A 1
2 B 2
That would be ok if the additional first column contained unique row names. So there are two ways to address the problem: 1. remove the empty column (ideally by fixing the process that produced that file) or 2. fix the row name issue.
Here is my suggestion using the second option:
As the data is tab separated, I'd use read.delim
which is just read table
with reasonable defaults for this kind of file. Of course that throws an error when used w/o some tweaking ("duplicate 'row.names' are not allowed"). To fix that, we need to tell it to use automatic row numbering. That way you get almost exactly what you want:
> read.delim(text=foo, row.names=NULL)
row.names ID V1 V2 V3 V4 V5
1 1 A NA NA NA 1
2 2 B NA NA NA 2
All that's left to do is get rid of the row.names column. Alternatively, you may want the ID
column to be turned into row.names:
> read.delim(text=foo, row.names='ID')
row.names V1 V2 V3 V4 V5
1 A NA NA NA 1
2 B NA NA NA 2
Hope that helps.
Upvotes: 1
Reputation: 27792
data.table's fread()
had no problem reading in the string... but your data seems to have a \t
too many (after each \n), which causes the creation of an extra column.
It is probably best practive to fix this in your export that creates your files.
If this is not possible, you can adjust fread()
's arguments to get the desired output.
Here we use drop
do delete the first column that was created due to the the extra \t
.
To get the right column-names back, we read the first line of the file again
string <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
data.table::fread( string,
drop = 1,
fill = TRUE,
col.names = as.matrix( fread(string, nrows = 1, header = FALSE))[1,] )
ID V1 V2 V3 V4 V5
1: 1 A NA NA NA 1
2: 2 B NA NA NA 2
Upvotes: 1
Reputation: 50738
I can't get it to work with read.table
, so how about parsing the string the manual way
ss <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
library(tidyverse)
entries <- unlist(str_split(ss, "\t"))
ncol <- str_which(entries, "\n")[1]
entries %>%
str_remove("\\n") %>%
matrix(ncol = ncol, byrow = T, dimnames = list(NULL, .[1:ncol])) %>%
as.data.frame() %>%
slice(-1) %>%
mutate_if(is.factor, as.character) %>%
mutate_all(parse_guess)
# ID V1 V2 V3 V4 V5
#1 1 A NA NA NA 1
#2 2 B NA NA NA 2
Explanation: We split the string on "\t"
; the first occurrence of "\n"
tells us how many columns we have. We then tidy up the entries by removing the line break characters "\n"
, reshape as matrix
and then as data.frame
, fix the header, and let readr::parse_guess
guess the data type of every column.
For good measure we can roll everything into a function
read.my.data <- function(s) {
entries <- unlist(str_split(s, "\t"))
ncol <- str_which(entries, "\n")[1]
entries %>%
str_remove("\\n") %>%
matrix(ncol = ncol, byrow = T, dimnames = list(NULL, .[1:ncol])) %>%
as.data.frame() %>%
slice(-1) %>%
mutate_if(is.factor, as.character) %>%
mutate_all(parse_guess)
}
and confirm
read.my.data(ss)
# ID V1 V2 V3 V4 V5
#1 1 A NA NA NA 1
#2 2 B NA NA NA 2
Upvotes: 1