Leb
Leb

Reputation: 15953

Total number of chunks in pandas

In the following script, is there a way to find out how many "chunks" there are in total?

import pandas as pd
import numpy as np

data = pd.read_csv('data.txt', delimiter = ',', chunksize = 50000) 

for chunk in data:
    print(chunk)

Using len(chunk) will only give me how many each one has.

Is there a way to do it without adding the iteration manually?

Upvotes: 10

Views: 10136

Answers (2)

Sagar
Sagar

Reputation: 191

Ami's answer may work but I have found that the fastest way is to use the command wc -l on Linux. The python implementation would be as follows:

import subprocess    
def count_lines(filename):
    try:
        # Run 'wc -l' command and capture its output
        result = subprocess.run(['wc', '-l', filename], stdout=subprocess.PIPE, text=True)
        # Extract the line count from the command output
        line_count = int(result.stdout.split()[0])
        return line_count
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

Dividing this number by the chunksize and adding 1 to the result of this function gives exactly the number of chunks. This is because the chunk argument in pd.read_csv refers to the number of rows returned.

Upvotes: 0

Ami Tavory
Ami Tavory

Reputation: 76336

CSV, being row-based, does not allow a process to know how many lines there are in it until after it has all been scanned.

Very minimal scanning is necessary, though, assuming the CSV file is well formed:

sum(1 for row in open('data.txt', 'r'))

This might prove useful in case you need to calculate in advance how many chunks there are. A full CSV reader is an overkill for this. The above line has very low memory requirements, and does minimal parsing.

Upvotes: 11

Related Questions