Reputation: 1180
I have a large dataset (~250,000 records) and I used splitting
to make the data more approachable. I ended up with 250 splits.
I want to know which split has the most columns. I know I should be using list.files
, but I am not sure how to make it work.
I created the following reproducible example:
df1 <- data.frame(A = c("a"),B = (c("b")), C = (c("c")))
df2 <- data.frame(A = c("a"),B = (c("b")))
df3 <- data.frame(A = c("a"))
write.csv(df1, file = "df1.csv", row.names=FALSE)
write.csv(df2, file = "df2.csv", row.names=FALSE)
write.csv(df3, file = "df3.csv", row.names=FALSE)
filenames <- list.files(pattern="*.csv", full.names=TRUE)
Looking at the example above, I'd like to know that df1
has the most attributes compare to the other files.
Does for
loop and a simple ncol
function can make it work?
Upvotes: 0
Views: 1373
Reputation: 31161
read.csv
is very slow compared to readLines
if you only wnat to know the number of columns :
sapply(filenames, function(x) {y=readLines(x, n=1);nchar(y)-nchar(gsub(',','',y))+1})
#./df1.csv ./df2.csv ./df3.csv
# 3 2 1
Upvotes: 3
Reputation: 352
Piggy-backing off of @Andriy T.'s response, here is code that will produce a list of columns in each of your files:
lapply(list.files(),function(x){ncol(read.csv(x))})
From there, you can select the maximum, get the index, and figure out which file it corresponds to.
Upvotes: 2
Reputation: 2030
If you don't want to import each file to R u can use a file.info()
function to obtain the size of each file
sapply(list.files(), file.info)
In another case u can use read.csv(..., nrows = 10)
for example to see the structure and don't have to load the entire table
sapply(list.files(), function(...) ncol(read.csv(..., nrows = 10)))
Upvotes: 2