Why polars.scan_csv is even faster than disk reading speed?

Question

I am testing polars performance by LazyDataFrame API polars.scan_csv with filter. The performance is much better than I expect. Filtering a CSV file is even faster than the disk speed! WHY???

The CSV file is about 1.51 GB on my PC HDD.

testing code:

import polars as pl
t0 = time.time()
lazy_df = pl.scan_csv("kline.csv")
df = lazy_df.filter(pl.col('ts') == '2015-01-01').collect().to_pandas()
print(time.time() - t0)

> Output: 1.8616907596588135

It takes less than 2 seconds to scan the whole CSV file, which means that the scan speed is faster than 750MB/S. It is much faster than the disk speed, apparently.

user18559875 · Accepted Answer

What you're probably seeing is a common problem in benchmarking: the caching of files by your operating system. Most modern operating systems will attempt to cache files that are accessed, if the amount of RAM permits.

The first time you accessed the file, your operating system likely cached the 1.51 GB file in RAM (possibly even when you created the file). As such, subsequent retrievals are not really accessing your HDD -- they are running against the cached file in RAM, a process which is far faster than reading from your HDD. (Which is kind of the point of caching files in RAM.)

An Example

As an example, I created a 29.9 GB csv file, and purposely placed it on my NAS (network-attached storage) rather than on my local hard drive. For reference, my NAS and my machine are connected by a 10 gigabit/sec network.

Running this benchmarking code the first time took about 54 seconds.

import polars as pl
import time
start = time.perf_counter()
(
    pl.scan_csv('/mnt/bak-projects/StackOverflow/benchmark.csv')
    .filter(pl.col('col_0') == 100)
    .collect()
)
print(time.perf_counter() - start)

shape: (1, 27)
┌─────┬───────┬───────┬───────┬─────┬────────┬────────┬────────┬────────┐
│ id  ┆ col_0 ┆ col_1 ┆ col_2 ┆ ... ┆ col_22 ┆ col_23 ┆ col_24 ┆ col_25 │
│ --- ┆ ---   ┆ ---   ┆ ---   ┆     ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
│ f64 ┆ i64   ┆ i64   ┆ i64   ┆     ┆ i64    ┆ i64    ┆ i64    ┆ i64    │
╞═════╪═══════╪═══════╪═══════╪═════╪════════╪════════╪════════╪════════╡
│ 1.0 ┆ 100   ┆ 100   ┆ 100   ┆ ... ┆ 100    ┆ 100    ┆ 100    ┆ 100    │
└─────┴───────┴───────┴───────┴─────┴────────┴────────┴────────┴────────┘
>>> print(time.perf_counter() - start)
53.92608916899917

So, reading a 29.9 GB file in 54 seconds is roughly 29.9 GB * (8 bits-per-byte) / 54 seconds = 4.4 gigabits per second. Not bad for retrieving files from a network drive. And certainly within the realm of possibility on my 10 gigabit/sec network.

However, the file is now cached by my operating system (Linux) in RAM (I have 512 GB of RAM). So when I run the same benchmarking code a second time, it took a mere 3.5 seconds:

shape: (1, 27)
┌─────┬───────┬───────┬───────┬─────┬────────┬────────┬────────┬────────┐
│ id  ┆ col_0 ┆ col_1 ┆ col_2 ┆ ... ┆ col_22 ┆ col_23 ┆ col_24 ┆ col_25 │
│ --- ┆ ---   ┆ ---   ┆ ---   ┆     ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
│ f64 ┆ i64   ┆ i64   ┆ i64   ┆     ┆ i64    ┆ i64    ┆ i64    ┆ i64    │
╞═════╪═══════╪═══════╪═══════╪═════╪════════╪════════╪════════╪════════╡
│ 1.0 ┆ 100   ┆ 100   ┆ 100   ┆ ... ┆ 100    ┆ 100    ┆ 100    ┆ 100    │
└─────┴───────┴───────┴───────┴─────┴────────┴────────┴────────┴────────┘
>>> print(time.perf_counter() - start)
3.5459880090020306

If my 29.9 GB file was really pulled across my network, this would imply a network speed of at least 29.9 * 8 / 3.5 sec = 68 gigabits per second. (Clearly, not possible on my 10 Gigabit/sec network.)

And a third time: 2.9 seconds

shape: (1, 27)
┌─────┬───────┬───────┬───────┬─────┬────────┬────────┬────────┬────────┐
│ id  ┆ col_0 ┆ col_1 ┆ col_2 ┆ ... ┆ col_22 ┆ col_23 ┆ col_24 ┆ col_25 │
│ --- ┆ ---   ┆ ---   ┆ ---   ┆     ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
│ f64 ┆ i64   ┆ i64   ┆ i64   ┆     ┆ i64    ┆ i64    ┆ i64    ┆ i64    │
╞═════╪═══════╪═══════╪═══════╪═════╪════════╪════════╪════════╪════════╡
│ 1.0 ┆ 100   ┆ 100   ┆ 100   ┆ ... ┆ 100    ┆ 100    ┆ 100    ┆ 100    │
└─────┴───────┴───────┴───────┴─────┴────────┴────────┴────────┴────────┘
>>> print(time.perf_counter() - start)
2.8593162479992316

Depending on your operating system, there is a way to flush cached files from RAM before benchmarking.

Why polars.scan_csv is even faster than disk reading speed?

Answers (1)

An Example

Related Questions