Ohhm Prakash
Ohhm Prakash

Reputation: 481

Is it possible to get the number of rows in a CSV file without opening it?

I have a CSV file of size ~1 GB, and as my laptop is of basic configuration, I'm not able to open the file in Excel or R. But out of curiosity, I would like to get the number of rows in the file. How am I to do it, if at all I can do it?

Upvotes: 48

Views: 72849

Answers (6)

Aamir Abbas
Aamir Abbas

Reputation: 1

Just open the file in notepad++ and scroll till the end of file. you will find the number of rows.

Upvotes: -1

Jeff Bezos
Jeff Bezos

Reputation: 2253

Implementing Tony's answer in R:

file <- "/path/to/file"
cmd <- paste("wc -l <", file)
as.numeric(system(cmd, intern = TRUE))

This is about 4x faster than data.table for a file with 100k lines

>     microbenchmark::microbenchmark(
+         nrow(fread("~/Desktop/cmx_bool.csv", select = 1L)),
+         as.numeric(system("wc -l <~/Desktop/cmx_bool.csv", intern = TRUE))
+     )
Unit: milliseconds
                                                               expr       min        lq
                 nrow(fread("~/Desktop/cmx_bool.csv", select = 1L)) 128.06701 131.12878
 as.numeric(system("wc -l <~/Desktop/cmx_bool.csv", intern = TRUE))  27.70863  28.42997
      mean   median        uq      max neval
 150.43999 135.1366 142.99937 629.4880   100
  34.83877  29.5070  33.32973 270.3104   100

Upvotes: 0

pallevillesen
pallevillesen

Reputation: 745

Estimate number of lines based on size of first 1000 lines

size1000  <- sum(nchar(readLines(con = "dgrp2.tgeno", n = 1000)))

sizetotal <- file.size("dgrp2.tgeno")
1000 *  sizetotal / size1000

This is usually good enough for most purposes - and is a lot faster for huge files.

Upvotes: 8

Rich Scriven
Rich Scriven

Reputation: 99331

Option 1:

Through a file connection, count.fields() counts the number of fields per line of the file based on some sep value (that we don't care about here). So if we take the length of that result, theoretically we should end up with the number of lines (and rows) in the file.

length(count.fields(filename))

If you have a header row, you can skip it with skip = 1

length(count.fields(filename, skip = 1))

There are other arguments that you can adjust for your specific needs, like skipping blank lines.

args(count.fields)
# function (file, sep = "", quote = "\"'", skip = 0, blank.lines.skip = TRUE, 
#     comment.char = "#") 
# NULL

See help(count.fields) for more.

It's not too bad as far as speed goes. I tested it on one of my baseball files that contains 99846 rows.

nrow(data.table::fread("Batting.csv"))
# [1] 99846

system.time({ l <- length(count.fields("Batting.csv", skip = 1)) })
#   user  system elapsed 
#  0.528   0.000   0.503 

l
# [1] 99846
file.info("Batting.csv")$size
# [1] 6153740

(The more efficient) Option 2: Another idea is to use data.table::fread() to read the first column only, then take the number of rows. This would be very fast.

system.time(nrow(fread("Batting.csv", select = 1L)))
#   user  system elapsed 
#  0.063   0.000   0.063 

Upvotes: 29

Narahari B M
Narahari B M

Reputation: 337

Here is something I used:

testcon <- file("xyzfile.csv",open="r")
readsizeof <- 20000
nooflines <- 0
( while((linesread <- length(readLines(testcon,readsizeof))) > 0 ) 
nooflines <- nooflines+linesread )
close(testcon)
nooflines

Check out this post for more: https://www.r-bloggers.com/easy-way-of-determining-number-of-linesrecords-in-a-given-large-file-using-r/

Upvotes: 2

Tony Ruth
Tony Ruth

Reputation: 1408

For Linux/Unix:

wc -l filename

For Windows:

find /c /v "A String that is extremely unlikely to occur" filename

Upvotes: 72

Related Questions