Tammboy
Tammboy

Reputation: 331

R - how to convert long-data dataframe to sparse matrix

I have a large dataframe, 305k rows, with a two keys and a data column as follows:

enter image description here

I am trying to convert this to a sparse matrix using the following code in R:

#convert to factors
data$RID   = as.factor(data$RID)
data$HID   = as.factor(data$HID)
data$VALUE = as.numeric(data$VALUE)
str(data)

#remove nas
data = na.omit(data)

#create sparse matrix
X = with(data,sparseMatrix(i=RID, 
                           j=HID, 
                           x=VALUE,
                           dimnames=list(levels(RID), levels(HID))))

Which is producing the following error message:

Error in sparseMatrix(i = RID, j = HID, x = VALUE, dimnames = list(levels(RID),  : 
  NA's in (i,j) are not allowed
In addition: Warning messages:
1: In Ops.factor(i, !(m.i || i1)) : ‘+’ not meaningful for factors
2: In Ops.factor(j, !(m.j || i1)) : ‘+’ not meaningful for factors

I have removed NAs so i'm unsure why the error-NAs is appearing? It also has reference to '+'s within the factors but i have checked all 36k factors and there are no '+'s there?

Does anyone know what the solution is?

I have included a snapshot of the first 20 rows of data below so you can re-produce the issue:

"RID" "HID" "VALUE"
"361838" "620631" 76.55
"361838" "620671" 82.61
"361838" "620787" 57.73
"361838" "621146" 58.65
"361838" "637825" 64.15
"361838" "637859" 82.79
"361838" "641254" 50.38
"361838" "642105" 72.88
"361838" "646469" 45.79
"361838" "648400" 82.06
"395855" "301340" -5.12
"395855" "649304" 41.88
"395855" "650324" -30.83
"395855" "657458" 46.47
"395855" "658028" -0.53
"395855" "659504" 28.84
"395855" "660506" 29.03
"395855" "660519" 14.16
"395855" "660521" -38.17
"395855" "660547" 35.45

Although when i look at the factors, i get the following:

> str(data)
'data.frame':   20 obs. of  3 variables:
 $ RID  : Factor w/ 30608 levels "361838","395855",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ HID  : Factor w/ 37399 levels "2018","7990",..: 11604 11624 11709 11740 14031 14049 15086 15457 16821 17270 ...
 $ VALUE: num  76.5 82.6 57.7 58.6 64.2 ...

Upvotes: 2

Views: 1609

Answers (1)

Weihuang Wong
Weihuang Wong

Reputation: 13118

Try converting RID and HID to numeric in your call to sparseMatrix:

X <- with(data, sparseMatrix(i=as.numeric(RID), 
                       j=as.numeric(HID), 
                       x=as.numeric(VALUE),
                       dimnames=list(levels(RID), levels(HID))))

The reason why RID and HID need to be converted into factors first, then into numeric in the call to sparseMatrix is that otherwise sparseMatrix will take the values of RID and HID as indices for the rows/columns. In other words,

test <- data.frame(x = 101:105, y = 201:205, v = 1:25)
dim(with(test, sparseMatrix(i = x, j = y, x = v)))
# [1] 105 205

gives you a 105 x 205 matrix, even though what we had in mind, treating x and y as keys, was just a 5 x 5 matrix.

Upvotes: 2

Related Questions