Reputation: 331
I have a large dataframe, 305k rows, with a two keys and a data column as follows:
I am trying to convert this to a sparse matrix using the following code in R:
#convert to factors
data$RID = as.factor(data$RID)
data$HID = as.factor(data$HID)
data$VALUE = as.numeric(data$VALUE)
str(data)
#remove nas
data = na.omit(data)
#create sparse matrix
X = with(data,sparseMatrix(i=RID,
j=HID,
x=VALUE,
dimnames=list(levels(RID), levels(HID))))
Which is producing the following error message:
Error in sparseMatrix(i = RID, j = HID, x = VALUE, dimnames = list(levels(RID), :
NA's in (i,j) are not allowed
In addition: Warning messages:
1: In Ops.factor(i, !(m.i || i1)) : ‘+’ not meaningful for factors
2: In Ops.factor(j, !(m.j || i1)) : ‘+’ not meaningful for factors
I have removed NAs so i'm unsure why the error-NAs is appearing? It also has reference to '+'s within the factors but i have checked all 36k factors and there are no '+'s there?
Does anyone know what the solution is?
I have included a snapshot of the first 20 rows of data below so you can re-produce the issue:
"RID" "HID" "VALUE"
"361838" "620631" 76.55
"361838" "620671" 82.61
"361838" "620787" 57.73
"361838" "621146" 58.65
"361838" "637825" 64.15
"361838" "637859" 82.79
"361838" "641254" 50.38
"361838" "642105" 72.88
"361838" "646469" 45.79
"361838" "648400" 82.06
"395855" "301340" -5.12
"395855" "649304" 41.88
"395855" "650324" -30.83
"395855" "657458" 46.47
"395855" "658028" -0.53
"395855" "659504" 28.84
"395855" "660506" 29.03
"395855" "660519" 14.16
"395855" "660521" -38.17
"395855" "660547" 35.45
Although when i look at the factors, i get the following:
> str(data)
'data.frame': 20 obs. of 3 variables:
$ RID : Factor w/ 30608 levels "361838","395855",..: 1 1 1 1 1 1 1 1 1 1 ...
$ HID : Factor w/ 37399 levels "2018","7990",..: 11604 11624 11709 11740 14031 14049 15086 15457 16821 17270 ...
$ VALUE: num 76.5 82.6 57.7 58.6 64.2 ...
Upvotes: 2
Views: 1609
Reputation: 13118
Try converting RID
and HID
to numeric in your call to sparseMatrix
:
X <- with(data, sparseMatrix(i=as.numeric(RID),
j=as.numeric(HID),
x=as.numeric(VALUE),
dimnames=list(levels(RID), levels(HID))))
The reason why RID
and HID
need to be converted into factors first, then into numeric in the call to sparseMatrix
is that otherwise sparseMatrix
will take the values of RID
and HID
as indices for the rows/columns. In other words,
test <- data.frame(x = 101:105, y = 201:205, v = 1:25)
dim(with(test, sparseMatrix(i = x, j = y, x = v)))
# [1] 105 205
gives you a 105 x 205 matrix, even though what we had in mind, treating x
and y
as keys, was just a 5 x 5 matrix.
Upvotes: 2