Reputation: 3824
Lets have the following dataframe inside R:
df <- data.frame(sample=rnorm(1,0,1),params=I(list(list(mean=0,sd=1,dist="Normal"))))
df <- rbind(df,data.frame(sample=rgamma(1,5,5),params=I(list(list(shape=5,rate=5,dist="Gamma")))))
df <- rbind(df,data.frame(sample=rbinom(1,7,0.7),params=I(list(list(size=7,prob=0.7,dist="Binomial")))))
df <- rbind(df,data.frame(sample=rnorm(1,2,3),params=I(list(list(mean=2,sd=3,dist="Normal")))))
df <- rbind(df,data.frame(sample=rt(1,3),params=I(list(list(df=3,dist="Student-T")))))
The first column contains a random number of a probability distribution and the second column stores a list with its parameters and name.
The dataframe df
looks like:
sample params
1 0.85102972 0, 1, Normal
2 0.67313218 5, 5, Gamma
3 3.00000000 7, 0.7, ....
4 0.08488487 2, 3, Normal
5 0.95025523 3, Student-T
Q1: How can I have the list of name distributions for all records? df$params$dist
does not work. For a single record is easy, for example the third one: df$params[[3]]$dist
Q2: Is there any alternative way of storing data like this? something like a multi-dimensional dataframe? I do not want to add columns for each parameter because it will scatter the dataframe with missing values.
Upvotes: 3
Views: 7008
Reputation: 1123
If you absolutely must store this information in a data frame, one way of doing so springs to mind. You're currently using a params column in your data frame to store the parameters associated with the distributions. Perhaps a better way of doing this would be to (i) identify the maximum number of parameters that you'll need for any distribution, (ii) store the distribution names in a field called df$distribution, and (iii) store the parameters in dedicated parameter columns, the meaning of which will have to be decided upon based on the type of distribution.
For instance, any row with df$distribution = 'Normal' should have df$param1 = and df$param2 = . A row with df$distribution='Student' should have df$param1 = and df$param2 = NA. Something like the following:
dg <- data.frame(sample=rnorm(1, 0, 1), distribution='Normal',
param1=0, param2=1)
dg <- rbind(dg, data.frame(sample=rgamma(1, 5, 5),
distribution='Gamma', param1=5, param2=5))
dg <- rbind(dg, data.frame(sample=rt(1, 3), distribution='Student',
param1=3, param2=NA))
It's ugly, but it will give you what you want. And don't worry about the missing values; missing values are a fact of life when dealing with non-trivial data frames. They can be dealt with easily in R by appropriate use of things like na.rm and complete.cases().
Upvotes: 2
Reputation: 226097
Based on the data frame you have above,
sapply(df$params,"[[","dist")
(or lapply
if you prefer) would work.
I would probably put at least the names of the distributions in their own column, even if you want the parameters to be in variable-length lists.
Upvotes: 1
Reputation: 173527
It's probably more natural to store information like this in a pure list
structure, than in a data frame:
distList <- list(normal = list(sample=rnorm(1,0,1),params=list(mean=0,sd=1,dist="Normal")),
gamma = list(sample=rgamma(1,5,5),params=list(shape=5,rate=5,dist="Gamma")),
binom = list(sample=rbinom(1,7,0.7),params=list(size=7,prob=0.7,dist="Binomial")),
normal2 = list(sample=rnorm(1,2,3),params=list(mean=2,sd=3,dist="Normal")),
tdist = list(sample=rt(1,3),params=list(df=3,dist="Student-T")))
And then if you want to extract just the distribution name from each, we can use sapply
to loop over the list and extract just that piece:
sapply(distList,function(x) x[[2]]$dist)
normal gamma binom normal2 tdist
"Normal" "Gamma" "Binomial" "Normal" "Student-T"
Upvotes: 2