Reputation: 481
I have a CSV file of size ~1 GB, and as my laptop is of basic configuration, I'm not able to open the file in Excel or R. But out of curiosity, I would like to get the number of rows in the file. How am I to do it, if at all I can do it?
Upvotes: 48
Views: 72849
Reputation: 1
Just open the file in notepad++ and scroll till the end of file. you will find the number of rows.
Upvotes: -1
Reputation: 2253
Implementing Tony's answer in R:
file <- "/path/to/file"
cmd <- paste("wc -l <", file)
as.numeric(system(cmd, intern = TRUE))
This is about 4x faster than data.table
for a file with 100k lines
> microbenchmark::microbenchmark(
+ nrow(fread("~/Desktop/cmx_bool.csv", select = 1L)),
+ as.numeric(system("wc -l <~/Desktop/cmx_bool.csv", intern = TRUE))
+ )
Unit: milliseconds
expr min lq
nrow(fread("~/Desktop/cmx_bool.csv", select = 1L)) 128.06701 131.12878
as.numeric(system("wc -l <~/Desktop/cmx_bool.csv", intern = TRUE)) 27.70863 28.42997
mean median uq max neval
150.43999 135.1366 142.99937 629.4880 100
34.83877 29.5070 33.32973 270.3104 100
Upvotes: 0
Reputation: 745
Estimate number of lines based on size of first 1000 lines
size1000 <- sum(nchar(readLines(con = "dgrp2.tgeno", n = 1000)))
sizetotal <- file.size("dgrp2.tgeno")
1000 * sizetotal / size1000
This is usually good enough for most purposes - and is a lot faster for huge files.
Upvotes: 8
Reputation: 99331
Option 1:
Through a file connection, count.fields()
counts the number of fields per line of the file based on some sep
value (that we don't care about here). So if we take the length of that result, theoretically we should end up with the number of lines (and rows) in the file.
length(count.fields(filename))
If you have a header row, you can skip it with skip = 1
length(count.fields(filename, skip = 1))
There are other arguments that you can adjust for your specific needs, like skipping blank lines.
args(count.fields)
# function (file, sep = "", quote = "\"'", skip = 0, blank.lines.skip = TRUE,
# comment.char = "#")
# NULL
See help(count.fields)
for more.
It's not too bad as far as speed goes. I tested it on one of my baseball files that contains 99846 rows.
nrow(data.table::fread("Batting.csv"))
# [1] 99846
system.time({ l <- length(count.fields("Batting.csv", skip = 1)) })
# user system elapsed
# 0.528 0.000 0.503
l
# [1] 99846
file.info("Batting.csv")$size
# [1] 6153740
(The more efficient) Option 2: Another idea is to use data.table::fread()
to read the first column only, then take the number of rows. This would be very fast.
system.time(nrow(fread("Batting.csv", select = 1L)))
# user system elapsed
# 0.063 0.000 0.063
Upvotes: 29
Reputation: 337
Here is something I used:
testcon <- file("xyzfile.csv",open="r")
readsizeof <- 20000
nooflines <- 0
( while((linesread <- length(readLines(testcon,readsizeof))) > 0 )
nooflines <- nooflines+linesread )
close(testcon)
nooflines
Check out this post for more: https://www.r-bloggers.com/easy-way-of-determining-number-of-linesrecords-in-a-given-large-file-using-r/
Upvotes: 2
Reputation: 1408
For Linux/Unix:
wc -l filename
For Windows:
find /c /v "A String that is extremely unlikely to occur" filename
Upvotes: 72