Reputation: 565
I have a 50k by 50k square matrix saved to disk in a text file and I would like to produce a simple histogram to see the distribution of the values in the matrix.
Obviously, when I try to load the matrix in R by using read.table()
, a memory error is encountered as the matrix is too big. Is there anyway I could possibly load smaller submatrices one at a time, but still produce a histogram that considers all the values of the original matrix? I can indeed load smaller submatrices, but I just override the histogram that I had for the last submatrix with the distribution of the new one.
Upvotes: 4
Views: 103
Reputation: 2368
Here's an approach. I don't have all the details because you did not provide sample data or the expected output, but one way to do this is through the read_chunked_csv
function in the readr package. First, you will need to write your summarisation function and then apply this to each chunk. See the below for a full repex.
# Call the Required Libraries
library(dplyr)
library(ggplot2)
library(readr)
# First Generate Some Fake Data
temp <- tempfile(fileext = ".csv")
fake_dat <- as.data.frame(matrix(rnorm(1000*100), ncol = 100))
write_csv(fake_dat, temp)
# Now write a summarisation function
# This will be applied to each chunk that is read into
# memory
summarise_for_hist <- function(x, pos){
x %>%
mutate(added_bin = cut(V1, breaks = -6:6)) %>%
count(added_bin)
}
# Note that I manually set the cutpoints or "breaks"
# argument. You would need to refine this based on your
# data and subject matter expertise
# A
small_read <- read_csv_chunked(temp, # data
DataFrameCallback$new(summarise_for_hist),
chunk_size = 200 # number of lines to read
)
Now that we have summarised our data, we can combine and plot it.
# Generate our histogram by combining all of the results
# and plotting
small_read %>%
group_by(added_bin) %>%
summarise(total = sum(n)) %>%
ggplot(aes(added_bin, total))+
geom_col()
This will yield the following:
Upvotes: 3