user1489597
user1489597

Reputation: 303

split a list of data frame in a for loop (error on dimension)

I have a very large data set, and I have already split it into 50 pieces So basically the file looks like: file1 file2 file3 . . . file50 (data frames)

file_total <- c(file1,...,file50)  

I know this will combine it into a list, but I can't use rbind since the whole all data is huge and the plyr library just takes forever to run

And in each of the files, I have to split them based on 1 factor, name it "id", then be able to write each of the id subsets into a .csv file

so far, my codes are:

d_split <- split(file1, file1[1])

library(plry)
id <- unlist(lapply(d_split,"[",1,1)) # this returns the unique id

for (j in seq_along(id))
{ 
    write.csv(d_split[[j]], file=paste(id[j], "csv", sep="."))
}

this works!!

but It doesn't work when I try to put it into a another for loop:

for (i in file_total)
{
    d_split <- split(i, i[1])
    id <- unlist(lapply(d_split,"[",1,1)) 
    for (j in seq_along(id))
    {
        write.csv(d_split[[j]], file=paste(id[j], "csv", sep="."))
    }
}

It returns to the following error messages:

Error in FUN(X[[1L]], ...) : incorrect number of dimensions

I meant I could done it manually by copy and pasting 50 files into the code, but was just wondering if anyone could fix my code, so that one click will get it solved.

Upvotes: 1

Views: 894

Answers (1)

David Robinson
David Robinson

Reputation: 78590

The problem occurs based on how you combine the data. Instead of combining them with c, make them into a list:

file_total <- list(file1,...,file50) 

At this point, doing i in file_total will iterate as you want it to.

As an explanation: using c with data frames (as I'm assuming file1 and file2 are) will actually turn them into a list of vectors rather than a list of data frames. For instance:

file1 = data.frame(x=1:20)
file2 = data.frame(y=20:40)
file_total = c(file1, file2)
# file_total will be:
# $x
#  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
#
# $y
#  [1] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

Thus, iterating over them will actually iterate over the individual columns as vectors. However, using list to combine them will let you iterate over the data frames themselves:

> list(file1, file2)
[[1]]
    x
1   1
2   2
3   3
4   4
5   5
6   6
7   7
8   8
9   9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20

[[2]]
    y
1  20
2  21
3  22
4  23
5  24
6  25
7  26
8  27
9  28
10 29
11 30
12 31
13 32
14 33
15 34
16 35
17 36
18 37
19 38
20 39
21 40

Upvotes: 3

Related Questions