How to specify which columns to load in pyarrow.dataset

Question

I am trying to get only the columns what I want, like how we do in pandas.

use_cols = ["ArrDelay", "DepDelay"]
df = pd.read_csv(path, usecols=use_cols)
df

Is there an option similar to that in arrow ?

dataset = ds.dataset(path, format="csv")

Pace · Accepted Answer

I'm guessing what you want is...

table = dataset.to_table(columns=["ArrDelay", "DepDelay"])

The dataset methods scan(), to_batches(), and to_tables() all take the same arguments, which are documented on the scan() method.

Answers (1)