trent
trent

Reputation: 773

How to sum cell data from multiple CSV files into a single parquet?

QUESTION:

Is there a more efficient way of summing corresponding data entries from a handful of large CSV files (5-20 files, 600MB each) into a single parquet? All CSV files have the exact same format.


I am currently working on a program that generates images from math equations (think fractal flames). There are two parts to the program:

  1. Create a matrix of color values (c++).

  2. Normalize the color values and plot the matrix (Python).

In the first step, I generate a large set of data (~700 million points, each with [R, G, B, A] color values) and put each point into the appropriate bin of a (9000,9000,4) matrix. Then, I write the flattened matrix to a CSV and repeat the process between 5 and 20 times. Each CSV file has 81 million rows and 4 columns of data and follows the format below.

r,g,b,a
3,2,0,4   // bin index: [0,0] 
1,1,2,2   // bin index: [0,1] 
0,0,1,1   // bin index: [0,2]
...

NOTE: I am running part 1 on a different machine and do not have sudo access, so I do not think there is a way to write to a single parquet file initially.

In order to visualize the data, I need to sum the values from each bin across all files and save them in a single data structure e.g. numpy.array((9000,9000,4)). I would also like to save this data structure to a new file, ideally a parquet. This would make it easier to access the same data later, as well as allow me to delete the handful of CSV files taking up all that space.


Current Code:

import pyarrow as pa
import numpy as np
from pyarrow import csv, parquet

###################### CREATE PARQUET FILES FROM CSV

for i in range(2,7): ## 6 files (1-6)
    print("Reading file",i)
    table = csv.read_csv(r"C:\Users\Path\to\file{}.csv".format(i))
    print("Writing file",i)
    parquet.write_table(table,r"C:\Users\Path\to\newFile{}.parquet".format(i))

print("All files have been converted to .parquet")

###################### COMBINE PARQUET FILES

## Initializing numpy arrays for each column with the first file
table = csv.read_csv(r"C:\Users\Path\to\file1.csv")
rVals = (table.column('r')).to_numpy()
gVals = (table.column('g')).to_numpy()
bVals = (table.column('b')).to_numpy()
aVals = (table.column('a')).to_numpy()

for i in range(2,7):
    print("Reading file",i)
    table = parquet.read_table(r"C:\Users\Path\to\newFile{}.parquet".format(i))
    rVals += (table.column('r')).to_numpy()
    gVals += (table.column('g')).to_numpy()
    bVals += (table.column('b')).to_numpy()
    aVals += (table.column('a')).to_numpy()
    
table2 = pa.table([rVals,gVals,bVals,aVals],names=['r','g','b','a'])
print("Writing table to .parquet")
parquet.write_table(table2,r"C:\Users\Path\to\newFileTot.parquet")

This works, but takes a lot of RAM and adds a minute or two to the total runtime. Is there a better way to do this? I am open to any and all suggestions for how to improve the system as I don't have a lot of experience with this sort of file manipulation.

Upvotes: 0

Views: 480

Answers (1)

fmic_
fmic_

Reputation: 2446

You can use the dataset API in PyArrow to point to a system of CSV (or Parquet) files and convert them into a single Parquet file directly.

import pyarrow.dataset as ds

table = ds.dataset("C:\Users\Path\to\input", format="parquet")
ds.write_dataset(table, "C:\Users\Path\to\output", format = "parquet")

I describe this in more detail in a blog post

Upvotes: 3

Related Questions