How can divide a dataset based on percentage?

Question

I have a dataset like this

ID      var value
9442000 a   2.01
9442000 v   2.2
9442000 h   5.3
9442000 f   0.2
9442000 s   0.55
9442000 t   0.6
952001  d   0.22
952001  g   0.44
952001  g   0.44
952001  h   0.77
652115  a   4.66
652115  d   1.55
652115  s   2.55
652115  s   2.55

I want to separate this into two dataframes for calibration (75%) and validation (25%). Doing it for overall is easy, but I want to do it ID-wise. So basically, I want to ensure that 75% of EACH ID goes to calibration. For example, for ID 9442000, I want to put any four events (random) into calibration and 2 into validation dataframe.

Expected output:

*Calibration*
 ID var value
9442000 a   2.01
9442000 v   2.2
9442000 h   5.3
9442000 f   0.2
952001  d   0.22
952001  g   0.44
952001  g   0.44
652115  a   4.66
652115  d   1.55
652115  s   2.55

And

*validation*
ID  var value
9442000 s   0.55
9442000 t   0.6
952001  h   0.77
652115  s   2.55

Neal Fultz · Accepted Answer

First, define a variable for which group it goes in, then use split:

> df$test <- ave(df$ID,df$ID,FUN=function(X) seq_along(X) %% 4 == 1  )
> 
> split(df, df$test)
$`0`
        ID var value test
2  9442000   v  2.20    0
3  9442000   h  5.30    0
4  9442000   f  0.20    0
6  9442000   t  0.60    0
8   952001   g  0.44    0
9   952001   g  0.44    0
10  952001   h  0.77    0
12  652115   d  1.55    0
13  652115   s  2.55    0
14  652115   s  2.55    0

$`1`
        ID var value test
1  9442000   a  2.01    1
5  9442000   s  0.55    1
7   952001   d  0.22    1
11  652115   a  4.66    1

How can divide a dataset based on percentage?

Answers (2)

Related Questions