Reputation: 4227

split dataset into multiple datasets with random columns in r

I have a big dataset. I want to divide into "n" number of sub-dataset each with equal size "s". However the last data set may be less than other size if it is not divisible by number. And output them as csv file to working directory.

Lets say the following small example:

set.seed(1234)
mydf <- data.frame (matrix(sample(1:10, 130, replace = TRUE), ncol = 13))
mydf

   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
1   3  7  1  9  6  4  7  5  8   2   2   2   8
2   5  3  4  6  9  5  3 10  5   8  10   2  10
3   4  6 10  4  4  6  3  4  2   9   9   2   9
4  10 10  9  4  3  7  7  7 10   6   7  10   2
5  10  3  9  3  2 10  9  6  4   4   4   6   3
6   7  2  8  7  5  5 10 10  9   3   7   8   4
7   3  2  2  7 10  9  2  2 10   1   1  10   4
8   3  9  9  7  3  1  7  6 10   3  10   3   2
9   9  3  6  9  3  2  2  3  4   2   9  10  10
10  6  4  3  3  5  9  3  9 10   7   4   6  10

I want to create a function that randomly split the dataset in into n subsets (in this case say of size 3, as there are 13 columns - the last dataset will have 1 column rest 4 each have 3) and output as text file as separate dataset.

Here is what I did:

set.seed(123)
reshuffled <- sample(1:length(mydf),length(mydf), replace = FALSE)
# just crazy manual divide 
group1 <- reshuffled[1:3]; group2 <- reshuffled[4:6]; group3 <- reshuffled[7:9]
group4 <- reshuffled[10:12]; group5 <-  reshuffled[13]

# just manual 
data1 <- mydf[,group1]; data2 <- mydf[,group2]; ....so on;
# I want to write dimension of dataset at fist row of each dataset 
cat (dim(data1))
write.csv(data1, "data1.csv");  write.csv(data2, "data2.csv"); .....so on

Is it possible to loop the process as I have to generate 100 sub datasets?

Upvotes: 2

Answers (3)

PeterD

Reputation: 439

In order to partition 'mydf' in n nearly equal parts, I took inspiration from this question and corresponding answer: link.

It creates partition sizes of which the difference between the smallest and largest partition is as small as possible. in this example this difference is equal to 1. Example:

Partition method 1 - using the 'floor'-function (no reproducible code shown here). Divide 100 rows in 7 nearly equal parts/summands by subsequently sample floor(100/7) = 14 indices for the first 6 iterations. The 7th element is the remainder. This yields:

14, 14, 14, 14, 14, 14, 16. Sum = 100, max difference = 2

Partition method 2 - using the 'ceiling'-function (no reproducible code shown here). Using the 'ceiling'-function instead of the 'floor'-function gives similar results:

15, 15, 15, 15, 15, 15, 10. Sum = 100, max difference = 5

Partition method 3 - using the formula from reference above. When using the procedure below, the vector ('sequence_diff') of partition sizes is:

14, 14, 14, 15, 14, 14, 15. Sum = 100, max difference = 1

R-code:

set.seed(1234)
#I increased the number of rows in the data frame to 100
mydf <- data.frame (matrix(sample(x = 1:100, size = 1300, replace = TRUE), 
                    ncol = 13))

index_list      <- list()       #Will store the indices for all partitions
indices         <- 1:nrow(mydf) #Initially contains all indices for the dataset 'mydf'
numb_partitions <- 7            #Specifies the number of partitions

sequence <- floor(((nrow(mydf)*1:numb_partitions)/numb_partitions))
sequence <- c(0, sequence)

#'sequence_diff' will contain the number of instances for each partition.
sequence_diff <- vector()
for(j in 1:numb_partitions){
    sequence_diff[j] <- sequence[j+1] - sequence[j]   
}  

#Inspect 'sequence_diff' and verify it's elements sum up to the total 
#number of rows in 'mydf' (100).
> sequence_diff
[1] 14 14 14 15 14 14 15
> sum(sequence_diff)
[1] 100 #Correct!

for(i in 1:numb_partitions){

  #Use a different seed for each sampling iteration.
  set.seed(seed = i)

  #Sample from object 'indices' of size 1/'numb_partitions'
  indices_partition <- sample(x = indices, 
                              size = sequence_diff[i], 
                              replace = FALSE)

  #Remove the selected indices from 'indices' so these indices will not be 
  #selected in successive iterations.
  indices           <- setdiff(x = indices, y = indices_partition)

  #Store the indices for the i-th iteration in the list 'index_list'. This 
  #is just to verify later that 
  #the procedure has divided all indices in 'numb_partitions' disjunct sets.
  index_list[[i]]   <- indices_partition

  #Dynamically create a new object that is named 'mydfx' in which x is the 
  #i-th partition. 
  assign(x = paste0("mydf", i), value = mydf[indices_partition,])

  write.csv(x = get(x = paste0("mydf", i)),  #Dynamically get the object from environment.
            file = paste0("mydf", i,".csv"), #Dynamically assgin a name to the csv-file.
            sep = ",", 
            col.names = T, 
            row.names = FALSE    
}

#Check whether all index subsets are mutually exclusive: union should have 100 
#unique elements. 
length(unique(unlist(index_list)))
[1] 100 #Correct!

Upvotes: 0

Matthew Plourde

Reputation: 44614

just for fun, probably slower than juba's

mydf <- data.frame (matrix(sample(1:10, 130, replace = TRUE), ncol = 13))
size <- 3
by(t(mydf), 
   INDICES=sample(as.numeric(gl((ncol(mydf) %/% size) + 1, size, ncol(mydf))), 
                  ncol(mydf), 
                  replace=FALSE), 
   FUN=function(x) write.csv(t(x), paste(rownames(x), collapse='-'), row.names=F))

Upvotes: 1

juba

Reputation: 49033

Maybe there is a cleaner and simpler solution, but you can try the following :

mydf <- data.frame (matrix(sample(1:10, 130, replace = TRUE), ncol = 13))

## Number of columns for each sub-dataset
size <- 3

nb.cols <- ncol(mydf)
nb.groups <- nb.cols %/% size
reshuffled <- sample.int(nb.cols, replace=FALSE)
groups <- c(rep(1:nb.groups, each=size), rep(nb.groups+1, nb.cols %% size))
dfs <- lapply(split(reshuffled, groups), function(v) mydf[,v,drop=FALSE])

for (i in 1:length(dfs)) write.csv(dfs[[i]], file=paste("data",i,".csv",sep=""))

Upvotes: 1

split dataset into multiple datasets with random columns in r

Answers (3)

Related Questions