Reputation: 6447
(This was posted previously at the data-table-help mailing list, but it's been a few weeks without comment, and I did a little more to try to debug it.)
I ran into a strange error that an internet search only turns up in the commit log of data.table
:
# Error in dcast.data.table(test.table, as.formula(paste(class.col, "+", :
# retFirst must be integer vector the same length as nrow(i)
This came up on running a previously tested working dcast.data.table expression, on a data.table I have subsetted by randomly resampling Trial
with replacement. The offending section is this:
dcast.data.table(test.table,
Class + Time + Trial ~ Channel,
value.var = "Voltage",
fun.aggregate=identity)
It seems to be choking on near-duplicate rows in the input table (i.e., the error is the same with or without the id
column present in the table):
test.table <- structure(list(Trial = c(1169L, 1169L), Sample = c(155L, 155L
), Class = c(1L, 1L), Subject = structure(c(13L, 13L), .Label = c("s01",
"s02", "s03", "s04", "s05", "s06", "s07", "s08", "s09", "s10",
"s11", "s12", "s13"), class = "factor"), Channel = c(1L, 1L),
Voltage = structure(c(-0.992322316444497, -0.992322316444497
), "`scaled:center`" = -6.23438399446429e-16, "`scaled:scale`" = 1),
Time = c(201.149466192171, 201.149466192171), Baseline = c(0.688151312347969,
0.688151312347969), id = 1:2), .Names = c("Trial", "Sample",
"Class", "Subject", "Channel", "Voltage", "Time", "Baseline",
"id"), class = c("data.table", "data.frame"), row.names = c(NA,
-2L), sorted = "id")
test.table
# Trial Sample Class Subject Channel Voltage Time Baseline id
# 1: 1169 155 1 s13 1 -0.9923223 201.1495 0.6881513 1
# 2: 1169 155 1 s13 1 -0.9923223 201.1495 0.6881513 2
dcast.data.table(test.table,
Class + Time + Trial ~ Channel,
value.var = "Voltage",
fun.aggregate=identity)
# Error in dcast.data.table(test.table, Class + Time + Trial ~ Channel, :
# retFirst must be integer vector the same length as nrow(i)
Changing a single column in the dcast
formula gets close to the output I am looking for:
test.table[2,Trial:=1170]
dcast.data.table(test.table,
Class + Time + Trial ~ Channel,
value.var = "Voltage",
fun.aggregate=identity)
# Class Time Trial 1
# 1: 1 201.1495 1169 -0.9923223
# 2: 1 201.1495 1170 -0.9923223
What's bothering data.table? I tried changing keys and messing with the order of the formula terms just to see, because I don't understand the error, but that didn't work.
If I replace the function call with regular dcast
from reshape2
, I get a seemingly unrelated error:
# Error in vapply(indices, fun, .default) : values must be length 0, but FUN(X[[29]]) result is length 1
At this point in my code I don't care if the Trial
values are correct, so I could work around this by replacing it in the formula with id
, but I'm interested in a more general or robust solution.
Upvotes: 2
Views: 2188
Reputation: 118889
dcast.data.table
provides better error message whenfun.aggregate
is specified but it returns length != 1. Closes git #693. Thanks to Trevor Alexander for reporting here on SO.
I agree that the error message should be more helpful in understanding the issue and it usually is in data.table. This is just a case I hadn't foreseen.
If you could please file the issue here as a bug, I'll fix it when I've some time.
Your problem, however, seems quite trivial RTFM to me. From ?dcast.data.table
:
fun.aggregate
- Should the data be aggregated before casting? If the formula doesn't identify single observation for each cell, then aggregation defaults to length with a message.In the DETAILS section: "... fun.aggregate will have to be used. The aggregating function should take a vector as input and return a single value (or a list of length one) as output." ...
In your example, your formula's LHS results in two identical rows, which means fun.aggregate
has to be used - which'll default to length
if you dint use one (like reshape2:::dcast
does). And you've used identity
, which'll just return the values back. So it returns both the values for Voltage
, which the function doesn't like.
The error message should be something like:
Error:
fun.aggregate
should return, for each unique group (from formula'sLHS
), a length 1 vector, but returnslength=2
for a group.
Or something of that sort. Feel free to suggest better / clearer error messages.
PS: I don't understand what you mean by near-duplicate.
identical(test.table[1, list(Class, Time, Trial)],
test.table[2, list(Class, Time, Trial)])
# [1] TRUE
If you use id
column on the LHS, then you should be able to get the desired result, as you can now uniquely identify the rows...
dcast.data.table(test.table,
Class + Time + Trial ~ Channel + id,
value.var = "Voltage",
fun.aggregate=identity)
# Class Time Trial 1_1 1_2
# 1: 1 201.1495 1169 -0.9923223 -0.9923223
The function only considers the columns given in the formula LHS to find out if there are/aren't unique rows, not if your actual input data has unique rows (if that was the confusion).
To answer OP's 2nd comment:
The only way currently to get a result (without error) is if your function returns a list:
dcast.data.table(test.table,
Class + Time + Trial ~ Channel,
value.var = "Voltage",
fun.aggregate=list)
# Class Time Trial 1
# 1: 1 201.1495 1169 -0.9923223,-0.9923223
Then you can just check if the columns are all of length 1 and if so, unlist.
Upvotes: 2