Reputation: 670
I have a very large CSV file (6.2 GB). I want to calculate how many lines it has using python. What I currently have is the following:
import time
file_name = 'TickStory/EURUSD.csv'
start = time.time()
with open(file_name) as f:
line_count = sum(1 for line in f)
print(line_count)
end = time.time()
print(end - start)
Every column in the file has a fixed number of characters. The content of the file is as follows:
Timestamp,Bid price 2012-01-01 22:00:36.416,1.29368 2012-01-01 22:00:40.548,1.29366 2012-01-01 22:01:48.884,1.29365 2012-01-01 22:01:53.775,1.29365 2012-01-01 22:01:54.594,1.29366 2012-01-01 22:01:55.390,1.29367 2012-01-01 22:02:40.765,1.29368 2012-01-01 22:02:41.027,1.29368 ... ...
My current code takes around 49.99 seconds. Is there any way to make it faster?
Thanks in advance.
N.B: There are a lot of available solutions for finding line count cheaply using python. However, my situation is different than others because in my file all the lines have a fixed number of characters (except the header line). Is there any way to use that to my advantage?
Upvotes: 1
Views: 774
Reputation: 33159
Since each row has a fixed number of characters, just get the file's size in bytes with os.path.getsize
, subtract the length of the header, then divide by the length of each row. Something like this:
import os
file_name = 'TickStory/EURUSD.csv'
len_head = len('Timestamp,Bid price\n')
len_row = len('2012-01-01 22:00:36.416,1.29368\n')
size = os.path.getsize(file_name)
print((size - len_head) / len_row + 1)
This assumes all characters in the file are 1 byte.
Upvotes: 3
Reputation: 32294
Try running wc
on your ubuntu machine
import subprocess
result = subprocess.run(['wc', '-l', filename], capture_output=True)
print(result.stdout)
Upvotes: 0
Reputation: 161
Frankly, the time might not change much as you still have to load the whole file in memory. You can try this as you dont have to iterate through the file and python would do it for you:
import csv
with open('TickStory/EURUSD.csv',"r") as f:
reader = csv.reader(f,delimiter = ",")
data = list(reader)
row_count = len(data)
print(row_count)
In such a case I would suggest to just maintain an additional file containing metadata of this file with row_count and other details and taking care to update the metadata when an update is made to the file.
Upvotes: 0