LucasMation
LucasMation

Reputation: 2489

R, arrow: How to read .fwf format in R using arrow

I have a data that is stored in several fixed width files (perhaps not fitting memory). It would be great to be able to read that using arrow.

p <- 'path_to_my_files'
# p contains: a.txt, b.txt, c.txt,.... all fwf
dic <- fread('fwf_dictionary.csv') # contains two columns varname and lenght indicating the variables (and position) in the fwf file

Is there a read_fwf_arrow(path,dic) function? I imagine that a combination of read_delim_arrow (with a never occurring delimiter) with dplyr parsing for each column would be able to do the job, but I don't know how to loop through the variables in dic to extract each variable.

read_fwf_arrow <- function(path,dic){
    p %>% read_delimn_arrow(delim='#') %>% 
    loop_extracting_cols_that_can_run_on_arrow
}

Upvotes: 0

Views: 274

Answers (1)

boshek
boshek

Reputation: 4416

Right now there is no fixed width reading in arrow. However we can make use of some readr functionality and how Datasets work. The process below involves creating an intermediate object, so not reading directly from the fixed width files. Sometimes that's ok, sometime not. Hopefully useful to you though:

library(dplyr)
library(readr)
library(gdata) ## easiest way to create fixed width file
library(arrow)


## make a "big" fixed width dataset
write.fwf(x = quakes, file = "data.fwf", colnames = FALSE)

## create a temporary directory where you can write the chunks
tdir <- file.path(tempdir(), "arrow-tmp")
dir.create(tdir, showWarnings = FALSE, recursive = TRUE)

## callback function that reads in a chunk then converts that
## chunk into a parquet file
f <- function(x, pos) {
  d <- read_fwf(paste0(x, "\n"), fwf_widths(
    c(6, 6, 4, 4, 2),
    col_names = names(quakes)
  ))
  arrow::write_parquet(d, file.path(tdir, paste0("temp-", pos, ".parquet")))
}

## actually read in chunks 100 lines at a time
read_lines_chunked(file = "data.fwf", readr::SideEffectChunkCallback$new(f), chunk_size = 100)

## they are there
fs::dir_tree(tdir)
#> /var/folders/6f/c414rjlx2l512x7_80hp126c0000gn/T//RtmpHsOQmC/arrow-tmp
#> ├── temp-1.parquet
#> ├── temp-101.parquet
#> ├── temp-201.parquet
#> ├── temp-301.parquet
#> ├── temp-401.parquet
#> ├── temp-501.parquet
#> ├── temp-601.parquet
#> ├── temp-701.parquet
#> ├── temp-801.parquet
#> └── temp-901.parquet

## create a directory where you want your finished data to exist
tdir_real <- file.path(tempdir(), "arrow-real")

## open Dataset with chunked data then write it out again as parquet partitioned by station
arrow::open_dataset(tdir) %>%
  group_by(stations) %>%
  arrow::write_dataset(tdir_real)

fs::dir_tree(tdir_real)
#> /var/folders/6f/c414rjlx2l512x7_80hp126c0000gn/T//RtmpHsOQmC/arrow-real
#> ├── stations=0
#> │   └── part-0.parquet
#> ├── stations=1
#> │   └── part-0.parquet
#> ├── stations=2
#> │   └── part-0.parquet
#> ├── stations=3
#> │   └── part-0.parquet
#> ├── stations=4
#> │   └── part-0.parquet
#> ├── stations=5
#> │   └── part-0.parquet
#> ├── stations=6
#> │   └── part-0.parquet
#> ├── stations=7
#> │   └── part-0.parquet
#> ├── stations=8
#> │   └── part-0.parquet
#> └── stations=9
#>     └── part-0.parquet

Upvotes: 2

Related Questions