Reputation: 15953
In the following script, is there a way to find out how many "chunks" there are in total?
import pandas as pd
import numpy as np
data = pd.read_csv('data.txt', delimiter = ',', chunksize = 50000)
for chunk in data:
print(chunk)
Using len(chunk)
will only give me how many each one has.
Is there a way to do it without adding the iteration manually?
Upvotes: 10
Views: 10136
Reputation: 191
Ami's answer may work but I have found that the fastest way is to use the command wc -l on Linux. The python implementation would be as follows:
import subprocess
def count_lines(filename):
try:
# Run 'wc -l' command and capture its output
result = subprocess.run(['wc', '-l', filename], stdout=subprocess.PIPE, text=True)
# Extract the line count from the command output
line_count = int(result.stdout.split()[0])
return line_count
except Exception as e:
print(f"An error occurred: {e}")
return None
Dividing this number by the chunksize and adding 1 to the result of this function gives exactly the number of chunks. This is because the chunk argument in pd.read_csv refers to the number of rows returned.
Upvotes: 0
Reputation: 76336
CSV, being row-based, does not allow a process to know how many lines there are in it until after it has all been scanned.
Very minimal scanning is necessary, though, assuming the CSV file is well formed:
sum(1 for row in open('data.txt', 'r'))
This might prove useful in case you need to calculate in advance how many chunks there are. A full CSV reader is an overkill for this. The above line has very low memory requirements, and does minimal parsing.
Upvotes: 11