Reputation: 1433
I've got a data file A with 7 columns, no missing values, to which I've unix-join
ed a data file B that has 28 fields. The result file is C. If no match is found in B, then the output row in C only has 7 columns. If there is a match in B, then the output row in C has 35 columns. I've kicked around join
's -e
option to fill the missings 28 filds, but without success.
What I'm trying to do is duplicate SAS's MISSOVER
input statement in R. For example the following code works perfectly:
dat <- textConnection('x1,x2,x3,x4
1,2,"present","present"
3,4
5,6')
df <- read.csv(dat, sep=',' , header=T ,
colClasses = c("numeric" , "numeric", "character", "character"))
> df
x1 x2 x3 x4
1 1 2 present present
2 3 4
3 5 6
But when I try to load my C file, I get the following error (using TRUE
instead of T
):
df <- read.table( 'C.tab' , header=T , sep='\t', fill=TRUE,
colClasses = c(rep('numeric',7),rep('character',28)))
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 35 elements
The first line (second row in C, after the header), does indeed have only those 7 fields from A. In SAS I'd use the MISSOVER
statement to set all those trailing missing fields to some missing value. How can I do that in R? Thanks.
Upvotes: 0
Views: 316
Reputation: 263411
The fill=TRUE
setting to the parameters of read.table
(or its derivative cousin read.csv
) are probably what you are looking for.
df <- read.table(dat, sep=',' , header=T , fill=TRUE,
colClasses = c("numeric" , "numeric", "character", "character"))
df
#
x1 x2 x3 x4
1 1 2 present present
2 3 4
3 5 6
The default for fill is TRUE for read.csv
, but your error says you used fill=T suggesting that you have an object named T
in your workspace. The default for read.table is fill=!blank.lines.skip
and since the default is also blank.lines.skip = TRUE
, the usual default for fill
in read.table
is FALSE.
Your edited question suggests you have other problems in your character fields. The usual suspects are unmatched quotes or octothorpes(#
) which are effectively line terminators, so try this instead:
df <- read.table( 'C.tab' , header=T , sep='\t', fill=TRUE,
quote="",
comment.char="",
colClasses = c(rep('numeric',7),rep('character',28)))
If you are having difficulty with errors related to varying numbers of items per line, it can be very useful to use count.fields
. It accepts similar parameters to those used by read.table
. If you have a large number of input lines it can be useful to wrap the call to count.fields
in a table
call:
length_tbl <- table( count.fields( 'C.tab' , header=TRUE , sep='\t',
quote="",
comment.char="")
)
You can then experiment with different options. Once you know what you are looking for you can also identify the line numbers that are causing problems by wrapping a which
call around count.fields:
bad_lines <- which( count.fields( 'C.tab' , header=TRUE , sep='\t',
quote="",
comment.char="")
!= 7 # or whatever is the "correct" length
)
Upvotes: 3