Reputation: 53
I have a large data set that I am trying to work with. I am currently trying to separate my data set into three different data frames, that will be used for different points of testing.
ind<-sample(3, nrow(df1), replace =TRUE, prob=c(0.40, 0.50, 0.10))
df2<-as.data.frame(df1[ind==1,1:27])
df3<-as.data.frame(df1[ind==2, 1:27])
df4<-as.data.frame(df1[ind==3,1:27])
However, the first column in df1 is an invoice number, and multiple rows can have the same invoice number, as returns and mistakes are included. I am trying to find a way that will split the data up randomly, but keep all rows with the same invoice number together.
Any suggestions on how I may manage to accomplish this?
Upvotes: 3
Views: 33
Reputation: 96
ind1 <- which(df1[,1] == 1)
ind2 <- which(df1[,1] == 2)
ind3 <- which(df1[,1] == 3)
df2 <- as.data.frame(df1[sample(ind1, length(ind1), replace = TRUE), 1:27])
df3 <- as.data.frame(df1[sample(ind2, length(ind2), replace = TRUE), 1:27])
df4 <- as.data.frame(df1[sample(ind3, length(ind3), replace = TRUE), 1:27])
ind
determines which rows contain the the invoice numbers 1,2,3. Then to create the random data frames a random sample from only the rows that you wish are taken. Hope this helps.
Upvotes: 1
Reputation: 37641
Instead of sampling the rows, you could sample the unique invoice numbers and then select the rows with those invoice numbers.
## Some sample data
df1 = data.frame(invoice=sample(10,20, replace=T), V = rnorm(20))
## sample the unique values
ind = sample(3, length(unique(df1$invoice)), replace=T)
## Select rows by sampled invoice number
df1[df1$invoice %in% unique(df1$invoice)[ind==1], 1:2]
invoice V
2 8 -0.67717939
6 9 -0.89222154
9 8 -0.71756069
14 8 -0.03539096
15 2 0.38453752
16 9 -0.16298835
17 9 -0.30823521
20 2 -0.60198259
Upvotes: 1