Reputation: 1180

Counting columns in multiple files in r

I have a large dataset (~250,000 records) and I used splitting to make the data more approachable. I ended up with 250 splits.

I want to know which split has the most columns. I know I should be using list.files, but I am not sure how to make it work.

I created the following reproducible example:

df1 <- data.frame(A = c("a"),B = (c("b")), C = (c("c")))
df2 <- data.frame(A = c("a"),B = (c("b")))
df3 <- data.frame(A = c("a"))

write.csv(df1, file = "df1.csv", row.names=FALSE)
write.csv(df2, file = "df2.csv", row.names=FALSE)
write.csv(df3, file = "df3.csv", row.names=FALSE)

filenames <- list.files(pattern="*.csv", full.names=TRUE)

Looking at the example above, I'd like to know that df1 has the most attributes compare to the other files.

Does for loop and a simple ncol function can make it work?

Upvotes: 0

Answers (3)

Colonel Beauvel

Reputation: 31161

read.csv is very slow compared to readLines if you only wnat to know the number of columns :

sapply(filenames, function(x) {y=readLines(x, n=1);nchar(y)-nchar(gsub(',','',y))+1})
#./df1.csv ./df2.csv ./df3.csv 
#        3         2         1

Upvotes: 3

rhozzy

Reputation: 352

Piggy-backing off of @Andriy T.'s response, here is code that will produce a list of columns in each of your files:

lapply(list.files(),function(x){ncol(read.csv(x))})

From there, you can select the maximum, get the index, and figure out which file it corresponds to.

Upvotes: 2

Andriy T.

Reputation: 2030

If you don't want to import each file to R u can use a file.info() function to obtain the size of each file

sapply(list.files(), file.info)

In another case u can use read.csv(..., nrows = 10) for example to see the structure and don't have to load the entire table

sapply(list.files(), function(...) ncol(read.csv(..., nrows = 10)))

Upvotes: 2

Counting columns in multiple files in r

Answers (3)

Related Questions