Reputation: 6555
I'm using python (Django Framework) to read a CSV file. I pull just 2 lines out of this CSV as you can see. What I have been trying to do is store in a variable the total number of rows the CSV also.
How can I get the total number of rows?
file = object.myfilePath
fileObject = csv.reader(file)
for i in range(2):
data.append(fileObject.next())
I have tried:
len(fileObject)
fileObject.length
Upvotes: 172
Views: 391913
Reputation: 390
import pandas as pd
import csv
filename = 'data.csv'
row_count = sum(1 for line in open(filename))
# count no of lines
print("Number of records : - ",row_count)
The result was : Number of records : - 163210690
Upvotes: 0
Reputation: 392
With pyarrow lib, is almost 6 times faster than dixhom suggested method.
👉 Used: csv with 3,921,865 rows and 927MB file size
Standard
sum(1 for _ in open(file_path))
# result: 3.57 s ± 90.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With pyarrow
import pyarrow.csv as csv
sum([len(chunk) for chunk in csv.open_csv(file_path)])
# result: 854 ms ± 4.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 0
Reputation: 1
I think mine will be the simplest approach here:
import csv
file = open(filename, 'r')
csvfile = csv.reader(file)
file.close
print("row", len(list(csvfile)))
Upvotes: 0
Reputation: 637
If you have to parse the CSV (e.g., because of the presence of line breaks in the fields or commented out lines) but the CSV is too large to fit the memory all at once, you might parse the CSV piece-by-piece:
import pandas as pd
import os
import sys
csv.field_size_limit(sys.maxsize) # increase the maximal line length in pd.read_csv()
cnt = 0
for chunk in pd.read_csv(filepath, chunksize=10**6):
cnt += len(chunk)
print(cnt)
Upvotes: 0
Reputation: 5879
If you are working on a Unix system, the fastest method is the following shell command
cat FILE_NAME.CSV | wc -l
From Jupyter Notebook or iPython, you can use it with a !
:
! cat FILE_NAME.CSV | wc -l
Upvotes: -1
Reputation: 121
After iterating the whole file with csv.reader()
method, you have the total number of lines read, via instance variable line_num
:
import csv
with open('csv_path_file') as f:
csv_reader = csv.reader(f)
for row in csv_reader:
pass
print(csv_reader.line_num)
Quoting the official documentation:
csvreader.line_num
The number of lines read from the source iterator.
Small caveat:
Upvotes: 12
Reputation: 1157
might want to try something as simple as below in the command line:
sed -n '$=' filename
or
wc -l filename
Upvotes: 2
Reputation: 63
You can also use a classic for loop:
import pandas as pd
df = pd.read_csv('your_file.csv')
count = 0
for i in df['a_column']:
count = count + 1
print(count)
Upvotes: 3
Reputation: 595
import pandas as pd
data = pd.read_csv('data.csv')
totalInstances=len(data)
Upvotes: 1
Reputation: 59
I think we can improve the best answer a little bit, I'm using:
len = sum(1 for _ in reader)
Moreover, we shouldnt forget pythonic code not always have the best performance in the project. In example: If we can do more operations at the same time in the same data set Its better to do all in the same bucle instead make two or more pythonic bucles.
Upvotes: 4
Reputation: 15
try
data = pd.read_csv("data.csv")
data.shape
and in the output you can see something like (aa,bb) where aa is the # of rows
Upvotes: -2
Reputation: 309
To do it you need to have a bit of code like my example here:
file = open("Task1.csv")
numline = len(file.readlines())
print (numline)
I hope this helps everyone.
Upvotes: 20
Reputation: 3035
Thank you for the comments.
I tested several kinds of code to get the number of lines in a csv file in terms of speed. The best method is below.
with open(filename) as f:
sum(1 for line in f)
Here is the code tested.
import timeit
import csv
import pandas as pd
filename = './sample_submission.csv'
def talktime(filename, funcname, func):
print(f"# {funcname}")
t = timeit.timeit(f'{funcname}("{filename}")', setup=f'from __main__ import {funcname}', number = 100) / 100
print('Elapsed time : ', t)
print('n = ', func(filename))
print('\n')
def sum1forline(filename):
with open(filename) as f:
return sum(1 for line in f)
talktime(filename, 'sum1forline', sum1forline)
def lenopenreadlines(filename):
with open(filename) as f:
return len(f.readlines())
talktime(filename, 'lenopenreadlines', lenopenreadlines)
def lenpd(filename):
return len(pd.read_csv(filename)) + 1
talktime(filename, 'lenpd', lenpd)
def csvreaderfor(filename):
cnt = 0
with open(filename) as f:
cr = csv.reader(f)
for row in cr:
cnt += 1
return cnt
talktime(filename, 'csvreaderfor', csvreaderfor)
def openenum(filename):
cnt = 0
with open(filename) as f:
for i, line in enumerate(f,1):
cnt += 1
return cnt
talktime(filename, 'openenum', openenum)
The result was below.
# sum1forline
Elapsed time : 0.6327946722068599
n = 2528244
# lenopenreadlines
Elapsed time : 0.655304473598555
n = 2528244
# lenpd
Elapsed time : 0.7561274056295324
n = 2528244
# csvreaderfor
Elapsed time : 1.5571560935772661
n = 2528244
# openenum
Elapsed time : 0.773000013928679
n = 2528244
In conclusion, sum(1 for line in f)
is fastest. But there might not be significant difference from len(f.readlines())
.
sample_submission.csv
is 30.2MB and has 31 million characters.
Upvotes: 98
Reputation: 11096
This works for csv and all files containing strings in Unix-based OSes:
import os
numOfLines = int(os.popen('wc -l < file.csv').read()[:-1])
In case the csv file contains a fields row you can deduct one from numOfLines
above:
numOfLines = numOfLines - 1
Upvotes: 4
Reputation: 2930
row_count = sum(1 for line in open(filename))
worked for me.
Note : sum(1 for line in csv.reader(filename))
seems to calculate the length of first line
Upvotes: 6
Reputation: 871
First you have to open the file with open
input_file = open("nameOfFile.csv","r+")
Then use the csv.reader for open the csv
reader_file = csv.reader(input_file)
At the last, you can take the number of row with the instruction 'len'
value = len(list(reader_file))
The total code is this:
input_file = open("nameOfFile.csv","r+")
reader_file = csv.reader(input_file)
value = len(list(reader_file))
Remember that if you want to reuse the csv file, you have to make a input_file.fseek(0), because when you use a list for the reader_file, it reads all file, and the pointer in the file change its position
Upvotes: 10
Reputation: 2529
Use "list" to fit a more workably object.
You can then count, skip, mutate till your heart's desire:
list(fileObject) #list values
len(list(fileObject)) # get length of file lines
list(fileObject)[10:] # skip first 10 lines
Upvotes: 2
Reputation: 1121416
You need to count the number of rows:
row_count = sum(1 for row in fileObject) # fileObject is your csv.reader
Using sum()
with a generator expression makes for an efficient counter, avoiding storing the whole file in memory.
If you already read 2 rows to start with, then you need to add those 2 rows to your total; rows that have already been read are not being counted.
Upvotes: 253
Reputation: 173
Several of the above suggestions count the number of LINES in the csv file. But some CSV files will contain quoted strings which themselves contain newline characters. MS CSV files usually delimit records with \r\n, but use \n alone within quoted strings.
For a file like this, counting lines of text (as delimited by newline) in the file will give too large a result. So for an accurate count you need to use csv.reader to read the records.
Upvotes: 15
Reputation: 3147
import csv
count = 0
with open('filename.csv', 'rb') as count_file:
csv_reader = csv.reader(count_file)
for row in csv_reader:
count += 1
print count
Upvotes: 3