Reputation: 1
I have two sql using parquet-arrow:
`table` has 50 column
sql1 = `select * from table`, total_data_size = 45GB
sql2 = `select value from table`, total_data_size = 30GB
I add profile for io-throughput(Yeah, drop page-cache and just watch disk-io).
I found:
Parquet on HDFS: sql2 is faster than sql1, about 1.5 times which is reasonable
Parquet on local-disk(1MB randread=130MB;1MB read=250MB): sq1 is faster than sql2, about 4 times which is confusing.
I guess two reasons via iostat
:
select one column
is more rand read
and make the io-throughput decreaseselect *
will cache more page-cache and the hit-ratio is high in process though I drop page-cache before executing. so for the select *
, the io-throughput actually is benefit from cache hit ratio.Expect your help, thanks!
Upvotes: 0
Views: 77
Reputation: 1
I use cachestat to get the page-cache hit-ratio, and I found select *
has higher ratio(50%) than select one column
(27%), so the io-throughput of select *
is more better because of the page-cache
I try open with O_DIRECT
to read the parquet to make sure the conclusion, but it report errno: 22, strerror: Invalid argument
, I haven't found the error root cause, but I think the page-cache hit-ratio is the root cause for io-throughput.
However, why select *
has higher hit-ratio?
Upvotes: 0