nsa
nsa

Reputation: 565

Plotting a matrix "by parts" in R?

I have a 50k by 50k square matrix saved to disk in a text file and I would like to produce a simple histogram to see the distribution of the values in the matrix.

Obviously, when I try to load the matrix in R by using read.table(), a memory error is encountered as the matrix is too big. Is there anyway I could possibly load smaller submatrices one at a time, but still produce a histogram that considers all the values of the original matrix? I can indeed load smaller submatrices, but I just override the histogram that I had for the last submatrix with the distribution of the new one.

Upvotes: 4

Views: 103

Answers (1)

MDEWITT
MDEWITT

Reputation: 2368

Here's an approach. I don't have all the details because you did not provide sample data or the expected output, but one way to do this is through the read_chunked_csv function in the readr package. First, you will need to write your summarisation function and then apply this to each chunk. See the below for a full repex.


# Call the Required Libraries
library(dplyr)
library(ggplot2)
library(readr)

# First Generate Some Fake Data
temp <- tempfile(fileext = ".csv")

fake_dat <- as.data.frame(matrix(rnorm(1000*100), ncol = 100))
write_csv(fake_dat, temp)



# Now write a summarisation function
# This will be applied to each chunk that is read into
# memory
summarise_for_hist <- function(x, pos){
  x %>% 
    mutate(added_bin = cut(V1, breaks = -6:6)) %>% 
    count(added_bin)
}

# Note that I manually set the cutpoints or "breaks"
# argument. You would need to refine this based on your
# data and subject matter expertise

# A

small_read <- read_csv_chunked(temp, # data
                               DataFrameCallback$new(summarise_for_hist),
                               chunk_size = 200 # number of lines to read
                               )

Now that we have summarised our data, we can combine and plot it.


# Generate our histogram by combining all of the results
# and plotting

small_read %>% 
  group_by(added_bin) %>% 
  summarise(total = sum(n)) %>% 
  ggplot(aes(added_bin, total))+
  geom_col()

This will yield the following:

enter image description here

Upvotes: 3

Related Questions