Reputation: 837
I have a dataset (N of 2794) of which I want to extract a subset, randomly reallocate the class and put it back into the dataframe.
Example
| Index | B | C | Class|
| 1 | 3 | 4 | Dog |
| 2 | 1 | 9 | Cat |
| 3 | 9 | 1 | Dog |
| 4 | 1 | 1 | Cat |
From the above example, I want to random take N number of observations from column 'Class' and mix them up so you get something like this..
| Index | B | C | Class|
| 1 | 3 | 4 | Cat | Re-sampled
| 2 | 1 | 9 | Dog | Re-sampled
| 3 | 9 | 1 | Dog |
| 4 | 1 | 1 | Dog | Re-sampled
This code randomly extracts rows and re samples them, but I don't want to extract the rows. I want to keep them in the dataframe.
sample(Class[sample(nrow(Class),N),])
Upvotes: 0
Views: 1239
Reputation: 4537
What you're wanting to do is replace in-line some classes, but not others.
So, if we start with a data frame, df
set.seed(100)
df = data.frame(index = 1:100,
B = sample(1:10,100,replace = T),
C = sample(1:10,100,replace = T),
Class = sample(c('Cat','Dog','Bunny'),100,replace = T))
And you want to update 5 random rows, then we need to pick which rows to update and what new classes to put in those rows. By referencing unique(df$class)
you don't weight the classes by their current occurrence. You could adjust this with the weight
argument or remove unique
to use occurrence as weight.
n_rows = 5
rows_to_update = sample(1:100,n_rows,replace = F)
new_classes = sample(unique(df$Class),n_rows,replace = T)
rows_to_update
#> [1] 85 65 94 60 48
new_classes
#> [1] "Bunny" "Dog" "Dog" "Dog" "Bunny"
We can inspect what the original data looked like
df[rows_to_update,]
#> index B C Class
#> 85 85 1 2 Dog
#> 65 65 5 1 Bunny
#> 94 94 5 10 Dog
#> 60 60 3 7 Bunny
#> 48 48 9 1 Cat
We can update this in place with a reference to the column and the rows to update.
df$Class[rows_to_update] = new_classes
df[rows_to_update,]
#> index B C Class
#> 85 85 1 2 Bunny
#> 65 65 5 1 Dog
#> 94 94 5 10 Dog
#> 60 60 3 7 Dog
#> 48 48 9 1 Bunny
Upvotes: 0
Reputation: 901
Assuming Class
is how you named your datafame, you could do this:
library(dplyr)
bind_rows(
Class %>%
mutate(origin = 'not_sampled'),
Class %>%
sample(100, replace = TRUE) %>%
mutate(origin = 'sampled'))
Sample 100 observations of the original dataframe and stack them to the bottom of it. I am also adding a column so that you know if the observation was sampled or present in the dataframe from the beginning.
Upvotes: 0
Reputation: 1074
I simulated the data frame and did an example:
df <- data.frame(
ID=1:4,
Class=c('Dog', 'Cat', 'Dog', 'Cat')
)
N <- 2
sample_ids <- sample(nrow(df), N)
df$Class[sample_ids] <- sample(df$Class, length(sample_ids))
Upvotes: 1
Reputation: 503
Suppose df
is your data frame:
df <- data.frame(index=1:4, B=c(3,1,9,1), C=c(4,9,1,1), Class=c("Dog", "Cat", "Dog", "Cat"))
Would this do what you want?
dfSamp <- sample(1:nrow(df), N)
df$Class[dfSamp] <- sample(df$Class[dfSamp])
Upvotes: 2