Reputation: 773
I am currently working on a program that generates images from math equations (think fractal flames). There are two parts to the program:
Create a matrix of color values (c++).
Normalize the color values and plot the matrix (Python).
In the first step, I generate a large set of data (~700 million points, each with [R, G, B, A] color values) and put each point into the appropriate bin of a (9000,9000,4) matrix. Then, I write the flattened matrix to a CSV and repeat the process between 5 and 20 times. Each CSV file has 81 million rows and 4 columns of data and follows the format below.
r,g,b,a
3,2,0,4 // bin index: [0,0]
1,1,2,2 // bin index: [0,1]
0,0,1,1 // bin index: [0,2]
...
NOTE: I am running part 1 on a different machine and do not have sudo access, so I do not think there is a way to write to a single parquet file initially.
In order to visualize the data, I need to sum the values from each bin across all files and save them in a single data structure e.g. numpy.array((9000,9000,4)). I would also like to save this data structure to a new file, ideally a parquet. This would make it easier to access the same data later, as well as allow me to delete the handful of CSV files taking up all that space.
import pyarrow as pa
import numpy as np
from pyarrow import csv, parquet
###################### CREATE PARQUET FILES FROM CSV
for i in range(2,7): ## 6 files (1-6)
print("Reading file",i)
table = csv.read_csv(r"C:\Users\Path\to\file{}.csv".format(i))
print("Writing file",i)
parquet.write_table(table,r"C:\Users\Path\to\newFile{}.parquet".format(i))
print("All files have been converted to .parquet")
###################### COMBINE PARQUET FILES
## Initializing numpy arrays for each column with the first file
table = csv.read_csv(r"C:\Users\Path\to\file1.csv")
rVals = (table.column('r')).to_numpy()
gVals = (table.column('g')).to_numpy()
bVals = (table.column('b')).to_numpy()
aVals = (table.column('a')).to_numpy()
for i in range(2,7):
print("Reading file",i)
table = parquet.read_table(r"C:\Users\Path\to\newFile{}.parquet".format(i))
rVals += (table.column('r')).to_numpy()
gVals += (table.column('g')).to_numpy()
bVals += (table.column('b')).to_numpy()
aVals += (table.column('a')).to_numpy()
table2 = pa.table([rVals,gVals,bVals,aVals],names=['r','g','b','a'])
print("Writing table to .parquet")
parquet.write_table(table2,r"C:\Users\Path\to\newFileTot.parquet")
This works, but takes a lot of RAM and adds a minute or two to the total runtime. Is there a better way to do this? I am open to any and all suggestions for how to improve the system as I don't have a lot of experience with this sort of file manipulation.
Upvotes: 0
Views: 480
Reputation: 2446
You can use the dataset API in PyArrow to point to a system of CSV (or Parquet) files and convert them into a single Parquet file directly.
import pyarrow.dataset as ds
table = ds.dataset("C:\Users\Path\to\input", format="parquet")
ds.write_dataset(table, "C:\Users\Path\to\output", format = "parquet")
I describe this in more detail in a blog post
Upvotes: 3