Reputation: 4357
I've been dealing with a lot of 4-5 Gb csv files last few days at work and so that I know how much they progressed through reading/writing I wrote couple wrapper functions on top of pandas' methods. It all seems to work very well, a bit of overhead but convenience outweighs most issues.
At the same time, when reading a csv, so that the progress bar displays correct percentage, I need to know the number of rows in advance since that determines how many chunks there will be. The simplest solution I came up with is to simply load the 0th column of the csv before starting to load the rest and get its size. But this does take a bit of time when you have files of millions of rows in size.
Also, reading of a single column takes an unreasonably high proportion of total time: reading a single column in a csv with 125 columns a few million rows took ~24 seconds, reading the whole file is 63 seconds.
And this is a function I've been using to read csvs:
def read_csv_with_progressbar(filename: str,
chunksize: int = 50000) -> pd.DataFrame:
length = pd.read_csv(filename, usecols=[0])
length = length.values.shape[0]
total = length//chunksize
chunk_list = []
chunks = pd.read_csv(filename, chunksize=chunksize)
with tqdm(total=total, file=sys.stdout) as pbar:
for chunk in chunks:
chunk_list.append(chunk)
pbar.set_description('Reading source csv file')
pbar.update(1)
df = pd.concat([i for i in chunk_list], axis=0)
return df
Any way to get the number of rows in a csv faster that using my flawed method?
Upvotes: 3
Views: 7130
Reputation: 31
I was dealing with the same problem but the solutions proposed didnt work for me. Dealing with csv files over 20 GB in size the procesing time was still to large for me. Consulting with a co worker I found an almost instant solution using subprocess. It goes like:
import subprocess
num_lines = int(subprocess.check_output("wc -l test.csv", shell=True).split()[0]) - 1
subprocess chek_output
returns the number of lines including the header plus the path to the file, split
returns the number of lines as a str, int
converts to integer, and finally we substract 1 to account for the header.
Upvotes: 3
Reputation: 117926
Assuming there are no quoted strings (with newlines in them) or other shenanigans in your CSV file an accurate (but hacky) solution is to not even parse the file but simply count the number of newlines in the file:
chunk = 1024*1024 # Process 1 MB at a time.
f = np.memmap("test.csv")
num_newlines = sum(np.sum(f[i:i+chunk] == ord('\n'))
for i in range(0, len(f), chunk))
del f
Upvotes: 5