TJ87
TJ87

Reputation: 542

How to read data with many blank fields in R

I have a tab-delimited file that looks like this:

"ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"

I use this code to read in the data:

df <- read.table("path/to/file",header=TRUE,fill=TRUE)

The result is this:

df
  id V1 V2 V3 V4 V5
1  1  A  1 NA NA NA
2  2  B  2 NA NA NA

But I expect this:

df
  id V1 V2 V3 V4 V5
1  1  A  NA NA NA 1
2  2  B  NA NA NA 2

I've tried sep="\t" and na.strings=c(""," ",NULL) but those don't help.

Upvotes: 2

Views: 124

Answers (3)

Phil
Phil

Reputation: 185

As Quar already mentioned in his/her comment, your file has an extra tab in the beginning of every line, so the number of column labels does not match the number of data fields:

> foo <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
> cat(foo, "\n")
ID      V1      V2      V3      V4      V5
        1       A                               1
        2       B                               2 

That would be ok if the additional first column contained unique row names. So there are two ways to address the problem: 1. remove the empty column (ideally by fixing the process that produced that file) or 2. fix the row name issue.

Here is my suggestion using the second option:

As the data is tab separated, I'd use read.delim which is just read table with reasonable defaults for this kind of file. Of course that throws an error when used w/o some tweaking ("duplicate 'row.names' are not allowed"). To fix that, we need to tell it to use automatic row numbering. That way you get almost exactly what you want:

> read.delim(text=foo, row.names=NULL)
  row.names ID V1 V2 V3 V4 V5
1            1  A NA NA NA  1
2            2  B NA NA NA  2

All that's left to do is get rid of the row.names column. Alternatively, you may want the ID column to be turned into row.names:

> read.delim(text=foo, row.names='ID')
  row.names V1 V2 V3 V4 V5
1            A NA NA NA  1
2            B NA NA NA  2

Hope that helps.

Upvotes: 1

Wimpel
Wimpel

Reputation: 27792

data.table's fread() had no problem reading in the string... but your data seems to have a \t too many (after each \n), which causes the creation of an extra column.

It is probably best practive to fix this in your export that creates your files.

If this is not possible, you can adjust fread()'s arguments to get the desired output.

Here we use drop do delete the first column that was created due to the the extra \t.
To get the right column-names back, we read the first line of the file again

string <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
data.table::fread( string, 
                   drop = 1, 
                   fill = TRUE, 
                   col.names = as.matrix( fread(string, nrows = 1, header = FALSE))[1,] )


   ID V1 V2 V3 V4 V5
1:  1  A NA NA NA  1
2:  2  B NA NA NA  2

Upvotes: 1

Maurits Evers
Maurits Evers

Reputation: 50738

I can't get it to work with read.table, so how about parsing the string the manual way

ss <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"

library(tidyverse)

entries <- unlist(str_split(ss, "\t"))
ncol <- str_which(entries, "\n")[1]
entries %>%
    str_remove("\\n") %>%
    matrix(ncol = ncol, byrow = T, dimnames = list(NULL, .[1:ncol])) %>%
    as.data.frame() %>%
    slice(-1) %>%
    mutate_if(is.factor, as.character) %>%
    mutate_all(parse_guess)
#  ID V1 V2 V3 V4 V5
#1  1  A NA NA NA  1
#2  2  B NA NA NA  2

Explanation: We split the string on "\t"; the first occurrence of "\n" tells us how many columns we have. We then tidy up the entries by removing the line break characters "\n", reshape as matrix and then as data.frame, fix the header, and let readr::parse_guess guess the data type of every column.

For good measure we can roll everything into a function

read.my.data <- function(s) {
    entries <- unlist(str_split(s, "\t"))
    ncol <- str_which(entries, "\n")[1]
    entries %>%
        str_remove("\\n") %>%
        matrix(ncol = ncol, byrow = T, dimnames = list(NULL, .[1:ncol])) %>%
        as.data.frame() %>%
        slice(-1) %>%
        mutate_if(is.factor, as.character) %>%
        mutate_all(parse_guess)
}

and confirm

read.my.data(ss)
#  ID V1 V2 V3 V4 V5
#1  1  A NA NA NA  1
#2  2  B NA NA NA  2

Upvotes: 1

Related Questions