Reputation:
I have a data table that has ~74 million lines that I used blaze to load it.
from blaze import CSV, data
csv = CSV('train.csv')
t = data(csv)
It has fields these: A, B, C, D, E, F, G
Since this is such a large dataframe, how can I efficiently output rows that fit specific criteria? For example, I would want rows that have A==4, B==8, E==10. Is there a way to multitask the look-up? For example, by threading or parallel programming or something?
By parallel programming I mean for example, one thread will try to find the matching row from row 1 to row 100000, and the second thread will try to find the matching row from row 100001 to 200000, and so on...
Upvotes: 3
Views: 158
Reputation: 109520
Your selection criteria is quite simple:
t[(t.A == 4) & (t.B == 8) & (t.E == 10)]
Using the readily available iris
sample dataset as an example:
from blaze import data
from blaze.utils import example
iris = data(example('iris.csv'))
iris[(iris.sepal_length == 7) & (iris.petal_length > 2)]
sepal_length sepal_width petal_length petal_width species
50 7 3.2 4.7 1.4 Iris-versicolor
The docs discuss parallel processing in Blaze.
Note that one can only parallelize over datasets that can be easily split in a non-serial fashion. In particular one can not parallelize computation over a single CSV file. Collections of CSV files and binary storage systems like HDF5 and BColz all support multiprocessing.
Showing that the timings are approximately the same on a single csv file when using multiprocessing:
import multiprocessing
pool = multiprocessing.Pool(4)
%timeit -n 1000 compute(iris[(iris.sepal_length > 7) & (iris.petal_length > 2)],
map=pool.map)
1000 loops, best of 1: 12.1 ms per loop
%timeit -n 1000 compute(iris[(iris.sepal_length > 7) & (iris.petal_length > 2)])
1000 loops, best of 1: 11.7 ms per loop
Upvotes: 1