Reputation: 670

Count lines in very large file where line size is fixed

I have a very large CSV file (6.2 GB). I want to calculate how many lines it has using python. What I currently have is the following:

import time

file_name = 'TickStory/EURUSD.csv'    
start = time.time()

with open(file_name) as f:
    line_count = sum(1 for line in f)

print(line_count)

end = time.time()
print(end - start)

Every column in the file has a fixed number of characters. The content of the file is as follows:

Timestamp,Bid price
2012-01-01 22:00:36.416,1.29368
2012-01-01 22:00:40.548,1.29366
2012-01-01 22:01:48.884,1.29365
2012-01-01 22:01:53.775,1.29365
2012-01-01 22:01:54.594,1.29366
2012-01-01 22:01:55.390,1.29367
2012-01-01 22:02:40.765,1.29368
2012-01-01 22:02:41.027,1.29368
...
...

My current code takes around 49.99 seconds. Is there any way to make it faster?

Thanks in advance.

N.B: There are a lot of available solutions for finding line count cheaply using python. However, my situation is different than others because in my file all the lines have a fixed number of characters (except the header line). Is there any way to use that to my advantage?

Upvotes: 1

Answers (3)

wjandrea

Reputation: 33159

Since each row has a fixed number of characters, just get the file's size in bytes with os.path.getsize, subtract the length of the header, then divide by the length of each row. Something like this:

import os

file_name = 'TickStory/EURUSD.csv'

len_head = len('Timestamp,Bid price\n')
len_row = len('2012-01-01 22:00:36.416,1.29368\n')

size = os.path.getsize(file_name)

print((size - len_head) / len_row + 1)

This assumes all characters in the file are 1 byte.

Upvotes: 3

Iain Shelvington

Reputation: 32294

Try running wc on your ubuntu machine

import subprocess
result = subprocess.run(['wc', '-l', filename], capture_output=True)
print(result.stdout)

Upvotes: 0

Axxelerated

Reputation: 161

Frankly, the time might not change much as you still have to load the whole file in memory. You can try this as you dont have to iterate through the file and python would do it for you:

import csv

with open('TickStory/EURUSD.csv',"r") as f:
    reader = csv.reader(f,delimiter = ",")
    data = list(reader)
    row_count = len(data)
    print(row_count)

In such a case I would suggest to just maintain an additional file containing metadata of this file with row_count and other details and taking care to update the metadata when an update is made to the file.

Upvotes: 0

Count lines in very large file where line size is fixed

Answers (3)

Related Questions