bioinfornatics
bioinfornatics

Reputation: 1808

Pyarrow why and when should I use a stream buffer writer?

I need to read a huge volume from a custom binary file (using multiprocessing and random access) in order to perform computation and store to a parquet dataset. I know the number of column, but I do not know the number of items to store.

Thus, the file is cut in N logical part and each part are processes by a chunk reader function. So into these function should I:

  1. create x columns using the python type List (dynamic allocation through append) and at the end of the chunk create one record batch ?
  2. Same as (1) but each y records create a record batch. Here we control the number of items by batch. The process could be slower as we have to put a condition (if no_items == y:…) at each iteration
  3. Start as (1) and at the end the record batch is store into a RecordBatchStreamWriter in order to return a buffer
  4. Start as (2) and at the end the record batch is store into a RecordBatchStreamWriter in order to return a buffer
  5. other …

I read both ipc, memory, arrow-streaming-columnar but it is still unclear how to process a huge random access file.

Thanks for your insight

Upvotes: 1

Views: 2308

Answers (1)

Pace
Pace

Reputation: 43787

If your desired output is a parquet dataset and your input format is a custom binary file then the record batch APIs are probably not going to help you unless you are spreading the computation across multiple processes (not threads). You mention multiprocessing so it is possible this is the case but I just want to make sure you didn't mean multithreading.

Also, when you say a parquet dataset I am not sure if you mean one huge parquet file or if you mean a some number of parquet files in a common directory structure. Typically a "parquet dataset" is multiple parquet files.

There is a lot of flexibility in this task and so how you go about it may depend on how you want to later read the data.

Output

  1. Multiple files. You create multiple parquet files where each parquet file contains all of the columns of your data but only X rows. If you can make the batches partitioned on some column(s) in your data (e.g. measurement date / manufacturing date / device id / experiment criteria) then you will be able to use those as later query parameters very efficiently (e.g. if you then later query by measurement date you can figure out which files to load simply by the filenames & directory names). You will want the dataset APIs in this case. You can always partition by row number with this approach too although you want have as much advantage over the single file case in that scenario (it will still be easier to work with 3rd party parquet tools that only support whole-file parquet reads).

  2. Single file. You can create a huge single parquet file that is split up into many "row groups". This will allow you to read individual row groups at a later time or to do memory efficient batch processing by processing a single row group at a time. The downside here is that you can only batch by "row number" so if you later wanted to do a lookup by something like a timestamp column then you'd probably have to read/scan all the data (unless you knew some way to map timestamps to row number). If you want this approach you will need to use the ParquetWriter to write individual row groups.

Input

Before you can write to any of the above you will need to read in your custom format and convert the data into Arrow format. This is likely to be expensive if you do it purely in python but, since it is a custom format, it's unlikely you will be able to use builtin Arrow utilities to read the data. If your custom format has any kind of native library for reading in data then it will likely be faster to use those libraries to read in the data, extract the relevant data buffers, and wrap them with arrow metadata so that arrow can access them. Using python to manipulate the resulting Arrow metadata objects (e.g. tables / chunked arrays / etc.) should be pretty trivial/fast.

If you choose to parse the file in python then the cost of that is likely going to be so large that it won't really matter which options 1-4 you choose. Ideally though you will want to be able to read the input file into a chunk that you intend to write. Hopefully it will be easy to do something like read in a fixed size batch of rows (all columns) from your input file.

If you want your output format to be partitioned by one or more columns (e.g. partitioned by test criteria) then it will be more complicated. If there is no good way to know how test criteria map to rows you may either have to read the entire file into memory as a single table, make multiple passes, or have many files (e.g. instead of one file per test criteria you could have 20 per test criteria and just read the input file as 20 batches).

IPC

Since you mentioned multiprocessing and the record batch reader/writers I will talk a little bit about how they might be used. If you decide to read your file across multiple processes (potentially even multiple servers) and you want those processes to communicate then you could use the record batch reader/writers. For example, let's say you have one process that is entirely dedicated to reading in the custom file and parsing it into the arrow format. That process could then use a record batch writer to send those files to your next process which would read them in with a record batch reader. That second process could then, for example, do the work of writing that record batch out to some number of parquet files. The record batch reader/writer allow you to write in the arrow format for rapid transmission between the processes.

Summary

It's hard to say exactly what to do without knowing more about the file you are reading and your eventual processing/querying goals. I think the simple answer to your question is that the way you read your input will depend on how you eventually want to create your output.

Upvotes: 3

Related Questions