Micro
Micro

Reputation: 1

Why io-throughput of `select *` is faster than `select one colmun` in local disk using parquet-arrow

I have two sql using parquet-arrow:

`table` has 50 column
sql1 = `select * from table`, total_data_size = 45GB
sql2 = `select value from table`, total_data_size = 30GB

I add profile for io-throughput(Yeah, drop page-cache and just watch disk-io).

I found:

Parquet on HDFS: sql2 is faster than sql1, about 1.5 times which is reasonable

Parquet on local-disk(1MB randread=130MB;1MB read=250MB): sq1 is faster than sql2, about 4 times which is confusing.

I guess two reasons via iostat:

  1. the io-load is high(about 100~130MB/S, utils=90%~100%) when execute sql2, which seem mean the select one column is more rand read and make the io-throughput decrease
  2. select * will cache more page-cache and the hit-ratio is high in process though I drop page-cache before executing. so for the select *, the io-throughput actually is benefit from cache hit ratio.

Expect your help, thanks!

Upvotes: 0

Views: 77

Answers (1)

Micro
Micro

Reputation: 1

I use cachestat to get the page-cache hit-ratio, and I found select * has higher ratio(50%) than select one column(27%), so the io-throughput of select * is more better because of the page-cache

I try open with O_DIRECT to read the parquet to make sure the conclusion, but it report errno: 22, strerror: Invalid argument, I haven't found the error root cause, but I think the page-cache hit-ratio is the root cause for io-throughput.

However, why select * has higher hit-ratio?

Upvotes: 0

Related Questions