Reputation: 11707
Are there any packages in R that can generate a random dataset given a pre-existing template dataset?
For example, let's say I have the iris dataset:
data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
I want some function random_df(iris)
which will generate a data-frame with the same columns as iris but with random data (preferably random data that preserves certain statistical properties of the original, (e.g., mean and standard deviation of the numeric variables).
What is the easiest way to do this?
[Comment from question author moved here. --Editor's note]
I don't want to sample random rows from an existing dataset. I want to generate actually random data with all the same columns (and types) as an existing dataset. Ideally, if there is some way to preserve statistical properties of the data for numeric variables, that would be preferable, but it's not needed
Upvotes: 2
Views: 717
Reputation: 50738
How about this for a start:
Define a function that simulates data from df
by
numeric
columns in df
, with the same mean and sd as in the original data column, andfactor
columns. generate_data <- function(df, nrow = 10) {
as.data.frame(lapply(df, function(x) {
if (class(x) == "numeric") {
rnorm(nrow, mean = mean(x), sd = sd(x))
} else if (class(x) == "factor") {
sample(levels(x), nrow, replace = T)
}
}))
}
Then for example, if we take iris
, we get
set.seed(2019)
df <- generate_data(iris)
str(df)
#'data.frame': 10 obs. of 5 variables:
# $ Sepal.Length: num 6.45 5.42 4.49 6.6 4.79 ...
# $ Sepal.Width : num 2.95 3.76 2.57 3.16 3.2 ...
# $ Petal.Length: num 4.26 5.47 5.29 6.19 2.33 ...
# $ Petal.Width : num 0.487 1.68 1.779 0.809 1.963 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 3 2 1 2 3 2 1 1 2 3
It should be fairly straightfoward to extend the generate_data
function to account for other column types.
Upvotes: 1