Reputation: 993
I am trying to turn an .rds
file into a .feather
file for reading with Pandas in Python.
library(feather)
# Set working directory
data = readRDS("file.rds")
data_year = data[["1986"]]
# Try 1
write_feather(
data_year,
"data_year.feather"
)
# Try 2
write_feather(
as.data.frame(as.matrix(data_year)),
"data_year.feather"
)
Try 1 returns Error: 'x' must be a data frame
and Try 2 actually writes a *.feather
file but the file has a size of 4.5GB for a single year whereas the original *.rds
file has a size of 0.055GB for several years.
How can I turn the file into separate or non-separate *.feather
files for each year whilst maintaining an adequate file size?
data
looks like this:
data_year
looks like this:
*Update
I am open to any suggestions for making the data available for use in NumPy/Pandas whilst maintaining a modest file size!
Upvotes: 2
Views: 1174
Reputation: 11336
With scipy
and rpy2
, you can read each dgCMatrix
object directly into Python as a scipy.sparse.csc_matrix
object. Both use compressed sparse column (CSC) format, so there is actually zero need for preprocessing. All you need to do is pass the attributes of the dgCMatrix
object as arguments to the csc_matrix
constructor.
To test it out, I used R to create an RDS file storing a list of dgCMatrix
objects:
library("Matrix")
set.seed(1L)
d <- 6L
n <- 10L
l <- replicate(n, sparseMatrix(i = sample(d), j = sample(d), x = sample(d), repr = "C"), simplify = FALSE)
names(l) <- as.character(seq(1986L, length.out = n))
l[["1986"]]
## 6 x 6 sparse Matrix of class "dgCMatrix"
##
## [1,] . . 5 . . .
## [2,] 3 . . . . .
## [3,] . . . . . 6
## [4,] . 2 . . . .
## [5,] . . . . 1 .
## [6,] . . . 4 . .
saveRDS(l, file = "list_of_dgCMatrix.rds")
Then, in Python:
from scipy import sparse
from rpy2 import robjects
readRDS = robjects.r['readRDS']
l = readRDS('list_of_dgCMatrix.rds')
x = l.rx2('1986') # in R: l[["1986"]]
x
## <rpy2.robjects.methods.RS4 object at 0x120db7b00> [RTYPES.S4SXP]
## R classes: ('dgCMatrix',)
print(x)
## 6 x 6 sparse Matrix of class "dgCMatrix"
##
## [1,] . . 5 . . .
## [2,] 3 . . . . .
## [3,] . . . . . 6
## [4,] . 2 . . . .
## [5,] . . . . 1 .
## [6,] . . . 4 . .
data = x.do_slot('x') # in R: x@x
indices = x.do_slot('i') # in R: x@i
indptr = x.do_slot('p') # in R: x@p
shape = x.do_slot('Dim') # in R: x@Dim or dim(x)
y = sparse.csc_matrix((data, indices, indptr), tuple(shape))
y
## <6x6 sparse matrix of type '<class 'numpy.float64'>'
## with 6 stored elements in Compressed Sparse Column format>
print(y)
## (1, 0) 3.0
## (3, 1) 2.0
## (0, 2) 5.0
## (5, 3) 4.0
## (4, 4) 1.0
## (2, 5) 6.0
Here, y
is an object of class scipy.sparse.csc_matrix
. You should not need to use the toarray
method to coerce y
to an array with dense storage. scipy.sparse
implements all of the matrix operations that I can imagine needing. For example, here are the row and column sums of y
:
y.sum(1) # in R: as.matrix(rowSums(x))
## matrix([[5.],
## [3.],
## [6.],
## [2.],
## [1.],
## [4.]])
y.sum(0) # in R: t(as.matrix(colSums(x)))
## matrix([[3., 2., 5., 4., 1., 6.]])
Upvotes: 2
Reputation: 76575
Maybe something like the following function can be of help.
The function reshapes the sparse matrix to long format eliminating the zeros from it. This will reduce the final data.frame size and disk file size.
library(Matrix)
library(feather)
dgcMatrix_to_long_df <- function(x) {
res <- NULL
if(nrow(x) > 0L) {
for(i in 1:nrow(x)){
d <- as.matrix(x[i, , drop = FALSE])
d <- as.data.frame(d)
d$row <- i
d <- tidyr::pivot_longer(d, cols = -row, names_to = "col")
d <- d[d$value != 0,]
res <- rbind(res, d)
}
}
res
}
y <- dgcMatrix_to_long_df(data_year)
head(y)
## A tibble: 6 x 3
# row col value
# <int> <chr> <dbl>
#1 1 Col_0103 51
#2 1 Col_0149 6
#3 1 Col_0188 5
#4 1 Col_0238 89
#5 1 Col_0545 14
#6 1 Col_0547 58
path <- "my_data.feather"
write_feather(y, path)
z <- read_feather(path)
identical(y, z)
#[1] TRUE
# The file size is 232 KB though the initial matrix
# had 1 million elements stored as doubles,
# for a total of 8 MB, a saving of around 97%
file.size(path)/1024
#[1] 232.0234
The following function is much faster.
dgcMatrix_to_long_df2 <- function(x) {
res <- NULL
if(nrow(x) > 0L) {
for(i in 1:nrow(x)){
d <- as.matrix(x[i, , drop = FALSE])
inx <- which(d != 0, arr.ind = TRUE)
d <- cbind(inx, value = c(d[d != 0]))
d[, "row"] <- i
res <- rbind(res, d)
}
}
as.data.frame(res)
}
system.time(y <- dgcMatrix_to_long_df(data_year))
# user system elapsed
# 7.89 0.04 7.92
system.time(y <- dgcMatrix_to_long_df2(data_year))
# user system elapsed
# 0.14 0.00 0.14
set.seed(2022)
n <- 1e3
x <- rep(0L, n*n)
inx <- sample(c(FALSE, TRUE), n*n, replace = TRUE, prob = c(0.99, 0.01))
x[inx] <- sample(100, sum(inx), replace = TRUE)
data_year <- Matrix(x, n, n, dimnames = list(NULL, sprintf("Col_%04d", 1:n)))
Upvotes: 3