Fastest method for loading large dataset into python

I have some relatively large .mat files that I'm reading in to Python to eventually use in PyTorch. These files range in the number of rows (~55k to ~111k), but each has a little under 11k columns, with no header, and all the entries are floats. The data file sizes range from 5.8 GB to 11.8 GB. The .mat files came from a prior data processing step in Perl, so I'm not sure about the mat version; when I tried to load a file using scipy.io.loadmat, I received the following error: ValueError: Unknown mat file type, version 46, 56. I've tried pandas, dask, and astropy and been successful, but it takes between 4-6 minutes to load a single file. Here's the code for loading using each of the methods I've mentioned above, run as a timing experiment:

import pandas as pd
import dask.dataframe as dd
from astropy.io import ascii as aio
import numpy as np
import time

numberIterations = 6

daskTime = np.zeros((numberIterations,), dtype=float)
pandasTime = np.zeros((numberIterations,), dtype=float)
astropyTime = np.zeros(numberIterations,), dtype=float)

for ii in range(numberIterations):
   t0 = time.time()
   data = dd.read_csv(dataPath, delimiter='\t', dtype=np.float64, header=None)
   daskTime[ii] = time.time() - t0
   data = 0
   del(data)

   t0 = time.time()
   data = pd.read_csv(dataPath, delimiter='\t', dtype=np.float64, header=None)
   pandasTime[ii] = time.time() - t0
   data = 0 
   del(data)

   t0 = time.time()
   data = aio.read(dataPath, format='fast_no_header', delimiter='\t', header_start=None, guess=False)
   astropyTime[ii] = time.time() - t0
   data = 0
   del(data)

When I time these methods, dask is by far the slowest (by almost 3x), followed by pandas , and then astropy. For the largest file, the load time (in seconds) for 6 runs is:

dask: 1006.15 (avg), 1.14 (std)
pandas: 337.50 (avg), 5.84 (std)
astropy: 314.61 (avg), 2.02 (std)

I'm wondering if there is a faster way to load these files, since this is still quite long. Specifically, I'm wondering if there is perhaps a better library to use for the consistent loading of tabular float data and/or if there is a way to incorporate C/C++ or bash to read the files faster. I realize this question is a little open-ended; I'm hoping to get some ideas for how I can read these files in faster, so there is not a bunch of time wasted on just reading in the files.

Upvotes: 0

Answers (2)

d.w.

Reputation: 11

Pandas is already not as fast as other formats.

Pandas is very slow for large datasets. I don't know what the other answer is talking about.

I went from 3 plus hours to 3 minutes just switching from Pandas to Numpy binary files.

The easiest implementation for you would be using numpy binaries with using numba for anything computation-wise. Numba uses a c++ backend for codes like for loops.

I've tried many different formats and reading into memory or using a memory map is still faster than using something like Dask for most datasets although my datasets are less than 10GB and I have memory to spare.

Upvotes: 1

distracted-biologist

Reputation: 808

Given these were generated in perl, and given the code above works, these are tab-separated text files, not matlab files. Which would be appropriate for scipy.io.loadmat.

Generally, reading in text is slow, and will depend heavily on compression and IO limitations.

FWIW pandas is already pretty well optimised under the hood, and I doubt you would get significant gains from using C directly.

If you plan to use these files frequently it might be worth using zarr or hdf5 to represent tabular float data. I'd lean towards zarr if you have some experience with dask already. They work nicely together.

Upvotes: 3

Fastest method for loading large dataset into python

Answers (2)

Related Questions