Reputation: 1435
I have multiple .csv
files that I am trying to read with arrow::open_dataset()
but it is throwing an error due to column type inconsistency.
I found this question mostly related to my problem, but I am trying a slightly different approach.
I want to utilize autodetection from the arrow
type, using one sample CSV file. It is time-consuming to figure out all the types of columns.
Then, I take the schema and correct some of the columns that cause problems.
And then I use the updated schema to read all files.
Below is my approach:
data = read_csv_arrow('data.csv.gz', as_data_frame = F) # has more than 30 columns
sch = data$schema
print(sch)
Schema
trade_id: int64
secid: int64
side: int64
...
nonstd: int64
flags: string
I would like to change the 'trade_id'
column type from int64
to string
and leave other columns to be the same.
How can I update the schema?
I'm using R arrow
, but I guess answers related pyarrow
could be applicable.
Upvotes: 1
Views: 1139
Reputation: 930
There are a couple of different ways to do this; you could either extract the code for the schema and manually update it yourself, or you could save the schema as a variable and update it programmatically.
library(arrow)
# set up an arrow table
cars_table <- arrow_table(mtcars)
# view the schema
sch <- cars_table$schema
# print the code that makes up the schema - you could now copy this and edit it
sch$code()
#> schema(mpg = float64(), cyl = float64(), disp = float64(), hp = float64(),
#> drat = float64(), wt = float64(), qsec = float64(), vs = float64(),
#> am = float64(), gear = float64(), carb = float64())
# look at an individual element in the schema
sch[[2]]
#> Field
#> cyl: double
# update this element
sch[[2]] <- Field$create("cylinders", int32())
sch[[2]]
#> Field
#> cylinders: int32
sch$code()
#> schema(mpg = float64(), cylinders = int32(), disp = float64(), hp = float64(),
#> drat = float64(), wt = float64(), qsec = float64(), vs = float64(),
#> am = float64(), gear = float64(), carb = float64())
Upvotes: 5