Parseltongue
Parseltongue

Reputation: 11707

What is the best way to generate a random dataset from an existing dataset?

Are there any packages in R that can generate a random dataset given a pre-existing template dataset?

For example, let's say I have the iris dataset:

data(iris)
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

I want some function random_df(iris) which will generate a data-frame with the same columns as iris but with random data (preferably random data that preserves certain statistical properties of the original, (e.g., mean and standard deviation of the numeric variables).

What is the easiest way to do this?


[Comment from question author moved here. --Editor's note]

I don't want to sample random rows from an existing dataset. I want to generate actually random data with all the same columns (and types) as an existing dataset. Ideally, if there is some way to preserve statistical properties of the data for numeric variables, that would be preferable, but it's not needed

Upvotes: 2

Views: 717

Answers (1)

Maurits Evers
Maurits Evers

Reputation: 50738

How about this for a start:

Define a function that simulates data from df by

  1. drawing samples from a normal distribution for numeric columns in df, with the same mean and sd as in the original data column, and
  2. uniformly drawing samples from the levels of factor columns.
generate_data <- function(df, nrow = 10) {
    as.data.frame(lapply(df, function(x) {
        if (class(x) == "numeric") {
            rnorm(nrow, mean = mean(x), sd = sd(x))
        } else if (class(x) == "factor") {
            sample(levels(x), nrow, replace = T)
        }
    }))
}

Then for example, if we take iris, we get

set.seed(2019)
df <- generate_data(iris)
str(df)
#'data.frame':  10 obs. of  5 variables:
# $ Sepal.Length: num  6.45 5.42 4.49 6.6 4.79 ...
# $ Sepal.Width : num  2.95 3.76 2.57 3.16 3.2 ...
# $ Petal.Length: num  4.26 5.47 5.29 6.19 2.33 ...
# $ Petal.Width : num  0.487 1.68 1.779 0.809 1.963 ...
# $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 3 2 1 2 3 2 1 1 2 3

It should be fairly straightfoward to extend the generate_data function to account for other column types.

Upvotes: 1

Related Questions