Patusz
Patusz

Reputation: 29

subsetting matrix including NA's

I have a matrix like so:

     a    b    c    d
[1]  as   ac   ad   ae
[2]  bd   bf   bg   bh
[3]  NA   cf   cd   ce
[4]  NA   NA   dr   dy
[5]  NA   NA   NA   ej 

I would like to subset every column separately into a matrix or list based on 50% of the observations, so I would like my output to look like this:

     a    b    c    d
[1]  as   ac   ad   ae
[2]  NA   bf   bg   bh
[3]  NA   NA   NA   ce

So far I have used to code for separate columns without NA's.

mv.s <- subset(mv, mv <= quantile(mv, 0.5))    

now I was thinking of using something like

for (i in 1:15) {
mv.s[[i]] <- subset(mv[[i]], mv <= quantile(mv, 0.5))
}

However, when I do this I get the warning:

Error in quantile.default(mv, 0.5) : missing values and NaN's not allowed if 'na.rm' is FALSE

when I try this code:

for (i in 1:15) {
mv.s[[i]] <- subset(mv[[i]], mv <= quantile(mv[[i]], 0.5))
}

I get

Error in (1 - h) * qs[i] : non-numeric argument to binary operator

Any help would be appreciated.

Upvotes: 0

Views: 2163

Answers (2)

elevendollar
elevendollar

Reputation: 1204

Without using any package and just the apply function you could do the following.

apply(mat, 2, FUN = function(x){ sample(x, ceiling(length(x)/2), replace = FALSE)})

That takes a random sample of your observations per column without replacement and assumes that your matrix is called mat.

If you use set.seed(1) to make the random sample reproducible the result will look like this.

     [,1] [,2] [,3] [,4]
[1,] "bd" NA   NA   "ae"
[2,] NA   "ac" "cd" "ej"
[3,] NA   "cf" "bg" "dy"

Upvotes: 2

Raphael K
Raphael K

Reputation: 2353

The sample_frac() function in dplyr sounds like it fits your needs.

install.packages('dplyr')
library(dplyr)

subset_matrix <- apply(mv, 2, function(x) sample_frac(x, .5, replace = F))

You can specify which fraction of rows you want sampled in sample_frac(). Using apply() column-wise will give you that fraction of observations for each column.

I did not test this because you didn't provide a sample of your data, but it looks like it should work.

Upvotes: 1

Related Questions