Reputation: 837

Random sampling only a subset of data in R

I have a dataset (N of 2794) of which I want to extract a subset, randomly reallocate the class and put it back into the dataframe.

Example

| Index | B | C | Class|
| 1     | 3 | 4 | Dog  |
| 2     | 1 | 9 | Cat  |
| 3     | 9 | 1 | Dog  |
| 4     | 1 | 1 | Cat  |

From the above example, I want to random take N number of observations from column 'Class' and mix them up so you get something like this..

| Index | B | C | Class|
| 1     | 3 | 4 | Cat  | Re-sampled 
| 2     | 1 | 9 | Dog  | Re-sampled 
| 3     | 9 | 1 | Dog  |
| 4     | 1 | 1 | Dog  | Re-sampled

This code randomly extracts rows and re samples them, but I don't want to extract the rows. I want to keep them in the dataframe.

 sample(Class[sample(nrow(Class),N),])

Upvotes: 0

Answers (4)

Mark

Reputation: 4537

What you're wanting to do is replace in-line some classes, but not others.

So, if we start with a data frame, df

set.seed(100)
df = data.frame(index = 1:100,
                B = sample(1:10,100,replace = T),
                C = sample(1:10,100,replace = T),
                Class = sample(c('Cat','Dog','Bunny'),100,replace = T))

And you want to update 5 random rows, then we need to pick which rows to update and what new classes to put in those rows. By referencing unique(df$class) you don't weight the classes by their current occurrence. You could adjust this with the weight argument or remove unique to use occurrence as weight.

n_rows = 5
rows_to_update = sample(1:100,n_rows,replace = F)
new_classes = sample(unique(df$Class),n_rows,replace = T)
rows_to_update
#> [1] 85 65 94 60 48
new_classes
#> [1] "Bunny" "Dog"   "Dog"   "Dog"   "Bunny"

We can inspect what the original data looked like

df[rows_to_update,]
#>    index B  C Class
#> 85    85 1  2   Dog
#> 65    65 5  1 Bunny
#> 94    94 5 10   Dog
#> 60    60 3  7 Bunny
#> 48    48 9  1   Cat

We can update this in place with a reference to the column and the rows to update.

df$Class[rows_to_update] = new_classes
df[rows_to_update,]
#>    index B  C Class
#> 85    85 1  2 Bunny
#> 65    65 5  1   Dog
#> 94    94 5 10   Dog
#> 60    60 3  7   Dog
#> 48    48 9  1 Bunny

Upvotes: 0

D Pinto

Reputation: 901

Assuming Class is how you named your datafame, you could do this:

library(dplyr)

bind_rows(
  Class %>% 
    mutate(origin = 'not_sampled'),
  Class %>% 
    sample(100, replace = TRUE) %>% 
    mutate(origin = 'sampled'))

Sample 100 observations of the original dataframe and stack them to the bottom of it. I am also adding a column so that you know if the observation was sampled or present in the dataframe from the beginning.

Upvotes: 0

cirofdo

Reputation: 1074

I simulated the data frame and did an example:

df <- data.frame(
  ID=1:4,
  Class=c('Dog', 'Cat', 'Dog', 'Cat')
)

N <- 2
sample_ids <- sample(nrow(df), N)

df$Class[sample_ids] <- sample(df$Class, length(sample_ids))

Upvotes: 1

user387832

Reputation: 503

Suppose df is your data frame:

df <- data.frame(index=1:4, B=c(3,1,9,1), C=c(4,9,1,1), Class=c("Dog", "Cat", "Dog", "Cat"))

Would this do what you want?

dfSamp <- sample(1:nrow(df), N)
df$Class[dfSamp] <- sample(df$Class[dfSamp])

Upvotes: 2

Random sampling only a subset of data in R

Answers (4)

Related Questions